A high-performance benchmarking tool for LLM endpoints.
Requirements: Python 3.12+ (Python 3.12 is recommended for optimal performance. GIL-less mode in higher Python versions is not yet supported.)
# Clone the repository
# Note: This repo will be migrated to https://github.com/mlcommons/endpoints
git clone https://github.com/mlcommons/endpoints.git
cd endpoints
# Create virtual environment
python3.12 -m venv venv
source venv/bin/activate
# As a user
pip install .
# As a developer (with development and test extras)
pip install -e ".[dev,test]"
pre-commit install# Show help
inference-endpoint --help
# Show system information
inference-endpoint -v info
# Test endpoint connectivity
inference-endpoint probe \
--endpoints http://your-endpoint:8000 \
--model Qwen/Qwen3-8B
# Run offline benchmark (max throughput - uses all dataset samples)
inference-endpoint benchmark offline \
--endpoints http://your-endpoint:8000 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl
# Run online benchmark (sustained QPS - requires --target-qps, --load-pattern)
inference-endpoint benchmark online \
--endpoints http://your-endpoint:8000 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl \
--load-pattern poisson \
--target-qps 100
# With explicit sample count
inference-endpoint benchmark offline \
--endpoints http://your-endpoint:8000 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl \
--num-samples 5000# Start local echo server
python -m inference_endpoint.testing.echo_server --port 8765 &
# Test with dummy dataset (included in repo)
inference-endpoint benchmark offline \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl
# Stop echo server
pkill -f echo_serverSee Local Testing Guide for detailed instructions.
# Install test dependencies
pip install ".[test]"
# Run tests (excluding performance and explicit-run tests)
pytest -m "not performance and not run_explicitly"
# Run examples: follow instructions in examples/*/README.md- CLI Quick Reference - Command-line interface guide
- Local Testing Guide - Test with echo server
- Development Guide - How to contribute and develop
- GitHub Setup Guide - GitHub authentication and setup
The system follows a modular, event-driven architecture:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Dataset │ │ Load │ │ Endpoint │
│ Manager │───▶│ Generator │───▶│ Client │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Metrics │ │ Configuration │ │ Endpoint │
│ Collector │◄───│ Manager │ │ (External) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
- Load Generator: Central orchestrator managing query lifecycle
- Dataset Manager: Handles benchmark datasets and preprocessing
- Endpoint Client: Abstract interface for endpoint communication
- Metrics Collector: Performance measurement and analysis
- Configuration Manager: System configuration (TBD)
You can run accuracy evaluation with Pass@1 scoring by specifying accuracy datasets in the benchmark configuration. Currently, Inference Endpoints provides the following pre-defined accuracy benchmarks:
- GPQA (default: GPQA Diamond)
- AIME (default: AIME 2025)
- LiveCodeBench (default: lite, release_v6)
However, LiveCodeBench will not work out-of-the-box and requires some additional setup. See the LiveCodeBench documentation for details and explanations.
The following features are planned for future releases:
- Performance Tuning - Advanced performance optimization features
- Submission Ruleset Integration - Full MLPerf submission workflow support
- Documentation Generation and Hosting - Sphinx-based API documentation with GitHub Pages
We welcome contributions! Please see our Development Guide for details on:
- Setting up your development environment
- Code style and quality standards
- Testing requirements
- Pull request process
This project draws inspiration from and learns from the following excellent projects:
- MLCommons Inference - MLPerf Inference benchmark suite
- AIPerf - AI model performance profiling framework
- SGLang GenAI-Bench - Token-level performance evaluation tool
- vLLM Benchmarks - Performance benchmarking tools for vLLM
- InferenceMAX - LLM inference optimization toolkit
We are grateful to these communities for their contributions to LLM benchmarking and performance analysis.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- MLCommons - Machine Learning Performance Standards
- Project Repository
- MLPerf Inference
Credits to core contributors of the project:
- MLCommons Committee
- NVIDIA: Zhihan Jiang, Rashid Kaleem, Viraat Chandra, Alice Cheng
- ...
See ATTRIBUTION for detailed attribution information.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: See docs/ directory for guides