Skip to content

mlcommons/endpoints

Repository files navigation

MLPerf® Inference Endpoint Benchmarking System

A high-performance benchmarking tool for LLM endpoints.

Quick Start

Installation

Requirements: Python 3.12+ (Python 3.12 is recommended for optimal performance. GIL-less mode in higher Python versions is not yet supported.)

# Clone the repository
# Note: This repo will be migrated to https://github.com/mlcommons/endpoints
git clone https://github.com/mlcommons/endpoints.git
cd endpoints

# Create virtual environment
python3.12 -m venv venv
source venv/bin/activate

# As a user
pip install .

# As a developer (with development and test extras)
pip install -e ".[dev,test]"
pre-commit install

Basic Usage

# Show help
inference-endpoint --help

# Show system information
inference-endpoint -v info

# Test endpoint connectivity
inference-endpoint probe \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B

# Run offline benchmark (max throughput - uses all dataset samples)
inference-endpoint benchmark offline \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl

# Run online benchmark (sustained QPS - requires --target-qps, --load-pattern)
inference-endpoint benchmark online \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl \
  --load-pattern poisson \
  --target-qps 100

# With explicit sample count
inference-endpoint benchmark offline \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl \
  --num-samples 5000

Running Locally

# Start local echo server
python -m inference_endpoint.testing.echo_server --port 8765 &

# Test with dummy dataset (included in repo)
inference-endpoint benchmark offline \
  --endpoints http://localhost:8765 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl

# Stop echo server
pkill -f echo_server

See Local Testing Guide for detailed instructions.

Running Tests and Examples

# Install test dependencies
pip install ".[test]"

# Run tests (excluding performance and explicit-run tests)
pytest -m "not performance and not run_explicitly"

# Run examples: follow instructions in examples/*/README.md

📚 Documentation

🎯 Architecture

The system follows a modular, event-driven architecture:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Dataset       │    │   Load          │    │   Endpoint      │
│   Manager       │───▶│   Generator     │───▶│   Client        │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Metrics       │    │   Configuration │    │   Endpoint      │
│   Collector     │◄───│   Manager       │    │   (External)    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
  • Load Generator: Central orchestrator managing query lifecycle
  • Dataset Manager: Handles benchmark datasets and preprocessing
  • Endpoint Client: Abstract interface for endpoint communication
  • Metrics Collector: Performance measurement and analysis
  • Configuration Manager: System configuration (TBD)

Accuracy Evaluation

You can run accuracy evaluation with Pass@1 scoring by specifying accuracy datasets in the benchmark configuration. Currently, Inference Endpoints provides the following pre-defined accuracy benchmarks:

  • GPQA (default: GPQA Diamond)
  • AIME (default: AIME 2025)
  • LiveCodeBench (default: lite, release_v6)

However, LiveCodeBench will not work out-of-the-box and requires some additional setup. See the LiveCodeBench documentation for details and explanations.

🚧 Pending Features

The following features are planned for future releases:

  • Performance Tuning - Advanced performance optimization features
  • Submission Ruleset Integration - Full MLPerf submission workflow support
  • Documentation Generation and Hosting - Sphinx-based API documentation with GitHub Pages

🤝 Contributing

We welcome contributions! Please see our Development Guide for details on:

  • Setting up your development environment
  • Code style and quality standards
  • Testing requirements
  • Pull request process

🙏 Acknowledgements

This project draws inspiration from and learns from the following excellent projects:

We are grateful to these communities for their contributions to LLM benchmarking and performance analysis.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🔗 Links

👥 Contributors

Credits to core contributors of the project:

  • MLCommons Committee
  • NVIDIA: Zhihan Jiang, Rashid Kaleem, Viraat Chandra, Alice Cheng
  • ...

See ATTRIBUTION for detailed attribution information.

📞 Support

About

MLCommons Inference Endpoints repository

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors