A high-performance evaluation framework for Large Language Models with Rust-powered execution engine and Python user interface.
- π High Performance: Rust execution engine for parallel processing and caching
- π§ Extensible: Plugin architecture for models, evaluators, scorers, and reporters
- π Rich Reporting: HTML and JSON reports with interactive visualizations
- π― Multiple Evaluation Types: Multiple choice, freeform, code generation, and more
- π€ Model Agnostic: Support for OpenAI, Anthropic, and local models
- πΎ Smart Caching: Intelligent caching to avoid redundant computations
- π Benchmark Suite: Built-in benchmarks for math, coding, and reasoning
# Clone the repository
git clone https://github.com/your-org/forge-evals.git
cd forge-evals
# Build and install
./scripts/build.sh# Run a benchmark with default models
forge eval benchmarks/math
# Run with specific models
forge eval benchmarks/math --models gpt-4o claude-3-sonnet
# Run with custom configuration
forge run examples/reasoning_eval.yaml
# Generate HTML report
forge eval benchmarks/coding --report html --parallel 8anvil/
βββ python/forge_evals/ # Python interface
β βββ models/ # Model adapters
β βββ evals/ # Evaluation tasks
β βββ scoring/ # Scoring methods
β βββ reporting/ # Report generators
β βββ utils/ # Utilities
βββ rust/forge_evals_core/ # Rust execution engine
β βββ engine.rs # Core evaluation engine
β βββ executor.rs # Model execution
β βββ sandbox.rs # Code execution sandbox
β βββ scoring.rs # High-performance scoring
β βββ cache.rs # Caching system
βββ benchmarks/ # Evaluation benchmarks
β βββ math/ # Mathematical reasoning
β βββ coding/ # Code generation
β βββ reasoning/ # Logical reasoning
βββ examples/ # Example configurations
βββ scripts/ # Build and utility scripts
type: "multiple_choice"
config:
options: ["A", "B", "C", "D"]
require_reasoning: truetype: "freeform"
config:
max_length: 1000
require_context: falsetype: "code"
config:
language: "python"
require_tests: true
max_lines: 50models:
- name: "gpt-4o"
provider: "openai"
model_id: "gpt-4o"
api_key: "${OPENAI_API_KEY}"models:
- name: "claude-3-sonnet"
provider: "anthropic"
model_id: "claude-3-sonnet-20240229"
api_key: "${ANTHROPIC_API_KEY}"models:
- name: "llama3-70b"
provider: "local"
model_id: "llama3:70b"
api_config:
command: "ollama"scoring:
method: "exact"
case_sensitive: false
ignore_punctuation: truescoring:
method: "llm_judge"
judge_model: "gpt-4"
threshold: 0.7scoring:
method: "heuristics"
keywords: ["def", "class", "import"]
threshold: 0.6Interactive reports with:
- π Visual charts and graphs
- π Detailed result tables
- π Sample-by-sample analysis
- π Performance metrics
Structured data for:
- π€ Programmatic analysis
- π Custom visualizations
- π Integration with other tools
- Getting Started - Installation and first evaluation
- Evaluation Methods - Comprehensive guide to evaluation types and scoring
- Performance Metrics - Understanding latency, throughput, and statistical analysis
- Code Sandboxing - Security features for safe code execution
- API Reference - Python API documentation
- Examples - Sample evaluation configurations
- Automated Confidence Intervals: 95% CI for accuracy and mean scores using NumPy
- Latency Tracking: Per-request execution time with P95 percentiles
- Token Throughput: Automatic calculation of tokens/second
- Statistical Rigor: Wald intervals for categorical data, standard error for continuous
- macOS Sandboxing: Native Seatbelt (sandbox-exec) integration
- Linux Sandboxing: Bubblewrap (bwrap) namespace isolation
- Network Isolation: Complete network blocking for untrusted code
- Filesystem Protection: Read-only system access, writable /tmp only
- Parallel Execution: Delegatable task scheduling to Rust core
- FFI Integration: Seamless Python-Rust interop via PyO3
- Async Scoring: Non-blocking evaluation pipeline
- Intelligent Caching: Redis-compatible result caching
- Stats Engine: High-performance statistics with 95% Confidence Intervals using NumPy
- CLI: Command-line interface for easy usage
- Models: Adapters for different model providers
- Evaluators: Task-specific evaluation logic
- Scorers: Flexible scoring algorithms (Async supported)
- Reporters: Multiple output formats
- Stats Engine: High-performance statistics with 95% Confidence Intervals using NumPy
- Engine: Core evaluation orchestration, delegatable via FFI
- Executor: High-performance model execution
- Sandbox: Secure code execution environment using macOS Seatbelt (sandbox-exec)
- Cache: Intelligent caching system
- Scoring: Optimized scoring algorithms
To use the high-performance Rust execution engine instead of the Python runner, set use_rust_engine: true in your configuration or use the --rust flag (coming soon to CLI). This delegates task scheduling and parallelization to the Rust core for maximum throughput and lower overhead.
Coding evaluations are executed in a secure sandbox. On macOS, this uses the Seatbelt (sandbox-exec) facility to restrict network access and file system writes, ensuring that LLM-generated code cannot compromise your host system.
# Install development dependencies
pip install -e ".[dev]"
# Install Rust toolchain
rustup update stable
# Build in development mode
maturin develop# Python tests
python -m pytest tests/python/ -v
# Rust tests
cargo test --manifest-path rust/forge_evals_core/Cargo.toml
# Integration tests
python -m pytest tests/ -v# python/forge_evals/models/my_model.py
from .base import BaseModel
class MyModel(BaseModel):
async def generate(self, messages, **kwargs):
# Your implementation
pass# python/forge_evals/evals/my_eval.py
from .base import BaseEval
class MyEval(BaseEval):
def prepare_messages(self, sample):
# Your implementation
pass# python/forge_evals/scoring/my_scorer.py
from .base import BaseScorer
class MyScorer(BaseScorer):
def score(self, sample, result):
# Your implementation
passforge eval benchmarks/math --models gpt-4o --parallel 4# my_eval.yaml
name: "Custom Evaluation"
type: "freeform"
dataset:
file: "my_data.jsonl"
models:
- name: "gpt-4o"
provider: "openai"
model_id: "gpt-4o"
scoring:
method: "exact"forge run my_eval.yamlfrom forge_evals import EvalRunner, EvalConfig
config = EvalConfig(
name="My Evaluation",
dataset=DatasetConfig(path="data.jsonl"),
models=[ModelConfig(name="gpt-4o", provider="openai", model_id="gpt-4o")],
scoring=ScoringConfig(method="exact")
)
runner = EvalRunner(config)
results = await runner.run()- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Inspired by existing evaluation frameworks
- Built with Rust for performance and Python for usability
- Community contributions and feedback
- π Documentation
- π Issues
- π¬ Discussions
Made with β€οΈ by the Forge Evals Team