MLPerf® Inference Endpoint Benchmarking System

A high-performance benchmarking tool for LLM endpoints.

Quick Start

Installation

Requirements: Python 3.12+ (Python 3.12 is recommended for optimal performance. GIL-less mode in higher Python versions is not yet supported.)

# Clone the repository
# Note: This repo will be migrated to https://github.com/mlcommons/endpoints
git clone https://github.com/mlcommons/endpoints.git
cd endpoints

# Create virtual environment
python3.12 -m venv venv
source venv/bin/activate

# As a user
pip install .

# As a developer (with development and test extras)
pip install -e ".[dev,test]"
pre-commit install

Basic Usage

# Show help
inference-endpoint --help

# Show system information
inference-endpoint -v info

# Test endpoint connectivity
inference-endpoint probe \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B

# Run offline benchmark (max throughput - uses all dataset samples)
inference-endpoint benchmark offline \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl

# Run online benchmark (sustained QPS - requires --target-qps, --load-pattern)
inference-endpoint benchmark online \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl \
  --load-pattern poisson \
  --target-qps 100

# With explicit sample count
inference-endpoint benchmark offline \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl \
  --num-samples 5000

Running Locally

# Start local echo server
python -m inference_endpoint.testing.echo_server --port 8765 &

# Test with dummy dataset (included in repo)
inference-endpoint benchmark offline \
  --endpoints http://localhost:8765 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl

# Stop echo server
pkill -f echo_server

See Local Testing Guide for detailed instructions.

Running Tests and Examples

# Install test dependencies
pip install ".[test]"

# Run tests (excluding performance and explicit-run tests)
pytest -m "not performance and not run_explicitly"

# Run examples: follow instructions in examples/*/README.md

📚 Documentation

CLI Quick Reference - Command-line interface guide
Local Testing Guide - Test with echo server
Development Guide - How to contribute and develop
GitHub Setup Guide - GitHub authentication and setup

🎯 Architecture

The system follows a modular, event-driven architecture:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Dataset       │    │   Load          │    │   Endpoint      │
│   Manager       │───▶│   Generator     │───▶│   Client        │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Metrics       │    │   Configuration │    │   Endpoint      │
│   Collector     │◄───│   Manager       │    │   (External)    │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Load Generator: Central orchestrator managing query lifecycle
Dataset Manager: Handles benchmark datasets and preprocessing
Endpoint Client: Abstract interface for endpoint communication
Metrics Collector: Performance measurement and analysis
Configuration Manager: System configuration (TBD)

Accuracy Evaluation

You can run accuracy evaluation with Pass@1 scoring by specifying accuracy datasets in the benchmark configuration. Currently, Inference Endpoints provides the following pre-defined accuracy benchmarks:

GPQA (default: GPQA Diamond)
AIME (default: AIME 2025)
LiveCodeBench (default: lite, release_v6)

However, LiveCodeBench will not work out-of-the-box and requires some additional setup. See the LiveCodeBench documentation for details and explanations.

🚧 Pending Features

The following features are planned for future releases:

Performance Tuning - Advanced performance optimization features
Submission Ruleset Integration - Full MLPerf submission workflow support
Documentation Generation and Hosting - Sphinx-based API documentation with GitHub Pages

🤝 Contributing

We welcome contributions! Please see our Development Guide for details on:

Setting up your development environment
Code style and quality standards
Testing requirements
Pull request process

🙏 Acknowledgements

This project draws inspiration from and learns from the following excellent projects:

MLCommons Inference - MLPerf Inference benchmark suite
AIPerf - AI model performance profiling framework
SGLang GenAI-Bench - Token-level performance evaluation tool
vLLM Benchmarks - Performance benchmarking tools for vLLM
InferenceMAX - LLM inference optimization toolkit

We are grateful to these communities for their contributions to LLM benchmarking and performance analysis.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🔗 Links

MLCommons - Machine Learning Performance Standards
Project Repository
MLPerf Inference

👥 Contributors

Credits to core contributors of the project:

MLCommons Committee
NVIDIA: Zhihan Jiang, Rashid Kaleem, Viraat Chandra, Alice Cheng
...

See ATTRIBUTION for detailed attribution information.

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: See docs/ directory for guides

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
.cursor/rules		.cursor/rules
.github		.github
docs		docs
examples		examples
scripts		scripts
src/inference_endpoint		src/inference_endpoint
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
ATTRIBUTION		ATTRIBUTION
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLPerf® Inference Endpoint Benchmarking System

Quick Start

Installation

Basic Usage

Running Locally

Running Tests and Examples

📚 Documentation

🎯 Architecture

Accuracy Evaluation

🚧 Pending Features

🤝 Contributing

🙏 Acknowledgements

📄 License

🔗 Links

👥 Contributors

📞 Support

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MLPerf® Inference Endpoint Benchmarking System

Quick Start

Installation

Basic Usage

Running Locally

Running Tests and Examples

📚 Documentation

🎯 Architecture

Accuracy Evaluation

🚧 Pending Features

🤝 Contributing

🙏 Acknowledgements

📄 License

🔗 Links

👥 Contributors

📞 Support

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages