🔨 Forge Evals

A high-performance evaluation framework for Large Language Models with Rust-powered execution engine and Python user interface.

✨ Features

🚀 High Performance: Rust execution engine for parallel processing and caching
🔧 Extensible: Plugin architecture for models, evaluators, scorers, and reporters
📊 Rich Reporting: HTML and JSON reports with interactive visualizations
🎯 Multiple Evaluation Types: Multiple choice, freeform, code generation, and more
🤖 Model Agnostic: Support for OpenAI, Anthropic, and local models
💾 Smart Caching: Intelligent caching to avoid redundant computations
📈 Benchmark Suite: Built-in benchmarks for math, coding, and reasoning

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/your-org/forge-evals.git
cd forge-evals

# Build and install
./scripts/build.sh

Basic Usage

# Run a benchmark with default models
forge eval benchmarks/math

# Run with specific models
forge eval benchmarks/math --models gpt-4o claude-3-sonnet

# Run with custom configuration
forge run examples/reasoning_eval.yaml

# Generate HTML report
forge eval benchmarks/coding --report html --parallel 8

📁 Project Structure

anvil/
├── python/forge_evals/          # Python interface
│   ├── models/                  # Model adapters
│   ├── evals/                   # Evaluation tasks
│   ├── scoring/                 # Scoring methods
│   ├── reporting/               # Report generators
│   └── utils/                   # Utilities
├── rust/forge_evals_core/       # Rust execution engine
│   ├── engine.rs                # Core evaluation engine
│   ├── executor.rs              # Model execution
│   ├── sandbox.rs               # Code execution sandbox
│   ├── scoring.rs               # High-performance scoring
│   └── cache.rs                 # Caching system
├── benchmarks/                  # Evaluation benchmarks
│   ├── math/                    # Mathematical reasoning
│   ├── coding/                  # Code generation
│   └── reasoning/               # Logical reasoning
├── examples/                    # Example configurations
└── scripts/                     # Build and utility scripts

🎯 Evaluation Types

Multiple Choice

type: "multiple_choice"
config:
  options: ["A", "B", "C", "D"]
  require_reasoning: true

Freeform

type: "freeform"
config:
  max_length: 1000
  require_context: false

Code Generation

type: "code"
config:
  language: "python"
  require_tests: true
  max_lines: 50

🤖 Model Support

OpenAI

models:
  - name: "gpt-4o"
    provider: "openai"
    model_id: "gpt-4o"
    api_key: "${OPENAI_API_KEY}"

Anthropic

models:
  - name: "claude-3-sonnet"
    provider: "anthropic"
    model_id: "claude-3-sonnet-20240229"
    api_key: "${ANTHROPIC_API_KEY}"

Local Models (Ollama)

models:
  - name: "llama3-70b"
    provider: "local"
    model_id: "llama3:70b"
    api_config:
      command: "ollama"

📊 Scoring Methods

Exact Match

scoring:
  method: "exact"
  case_sensitive: false
  ignore_punctuation: true

LLM Judge

scoring:
  method: "llm_judge"
  judge_model: "gpt-4"
  threshold: 0.7

Heuristics

scoring:
  method: "heuristics"
  keywords: ["def", "class", "import"]
  threshold: 0.6

📈 Reports

HTML Reports

Interactive reports with:

📊 Visual charts and graphs
📋 Detailed result tables
🔍 Sample-by-sample analysis
📈 Performance metrics

JSON Reports

Structured data for:

🤖 Programmatic analysis
📊 Custom visualizations
🔌 Integration with other tools

📚 Documentation

Getting Started - Installation and first evaluation
Evaluation Methods - Comprehensive guide to evaluation types and scoring
Performance Metrics - Understanding latency, throughput, and statistical analysis
Code Sandboxing - Security features for safe code execution
API Reference - Python API documentation
Examples - Sample evaluation configurations

🎯 Key Features

Advanced Metrics & Statistics

Automated Confidence Intervals: 95% CI for accuracy and mean scores using NumPy
Latency Tracking: Per-request execution time with P95 percentiles
Token Throughput: Automatic calculation of tokens/second
Statistical Rigor: Wald intervals for categorical data, standard error for continuous

Secure Code Execution

macOS Sandboxing: Native Seatbelt (sandbox-exec) integration
Linux Sandboxing: Bubblewrap (bwrap) namespace isolation
Network Isolation: Complete network blocking for untrusted code
Filesystem Protection: Read-only system access, writable /tmp only

High-Performance Rust Engine

Parallel Execution: Delegatable task scheduling to Rust core
FFI Integration: Seamless Python-Rust interop via PyO3
Async Scoring: Non-blocking evaluation pipeline
Intelligent Caching: Redis-compatible result caching
Stats Engine: High-performance statistics with 95% Confidence Intervals using NumPy

🏗️ Architecture

Python Layer (User Interface)

CLI: Command-line interface for easy usage
Models: Adapters for different model providers
Evaluators: Task-specific evaluation logic
Scorers: Flexible scoring algorithms (Async supported)
Reporters: Multiple output formats
Stats Engine: High-performance statistics with 95% Confidence Intervals using NumPy

Rust Layer (Performance Engine)

Engine: Core evaluation orchestration, delegatable via FFI
Executor: High-performance model execution
Sandbox: Secure code execution environment using macOS Seatbelt (sandbox-exec)
Cache: Intelligent caching system
Scoring: Optimized scoring algorithms

⚡ Rust Performance Engine

To use the high-performance Rust execution engine instead of the Python runner, set use_rust_engine: true in your configuration or use the --rust flag (coming soon to CLI). This delegates task scheduling and parallelization to the Rust core for maximum throughput and lower overhead.

🛡️ Sandboxing

Coding evaluations are executed in a secure sandbox. On macOS, this uses the Seatbelt (sandbox-exec) facility to restrict network access and file system writes, ensuring that LLM-generated code cannot compromise your host system.

🛠️ Development

Setup

# Install development dependencies
pip install -e ".[dev]"

# Install Rust toolchain
rustup update stable

# Build in development mode
maturin develop

Running Tests

# Python tests
python -m pytest tests/python/ -v

# Rust tests
cargo test --manifest-path rust/forge_evals_core/Cargo.toml

# Integration tests
python -m pytest tests/ -v

Adding New Components

Model Adapter

# python/forge_evals/models/my_model.py
from .base import BaseModel

class MyModel(BaseModel):
    async def generate(self, messages, **kwargs):
        # Your implementation
        pass

Evaluation Type

# python/forge_evals/evals/my_eval.py
from .base import BaseEval

class MyEval(BaseEval):
    def prepare_messages(self, sample):
        # Your implementation
        pass

Scoring Method

# python/forge_evals/scoring/my_scorer.py
from .base import BaseScorer

class MyScorer(BaseScorer):
    def score(self, sample, result):
        # Your implementation
        pass

📚 Examples

Basic Evaluation

forge eval benchmarks/math --models gpt-4o --parallel 4

Custom Configuration

# my_eval.yaml
name: "Custom Evaluation"
type: "freeform"
dataset:
  file: "my_data.jsonl"
models:
  - name: "gpt-4o"
    provider: "openai"
    model_id: "gpt-4o"
scoring:
  method: "exact"

forge run my_eval.yaml

Programmatic Usage

from forge_evals import EvalRunner, EvalConfig

config = EvalConfig(
    name="My Evaluation",
    dataset=DatasetConfig(path="data.jsonl"),
    models=[ModelConfig(name="gpt-4o", provider="openai", model_id="gpt-4o")],
    scoring=ScoringConfig(method="exact")
)

runner = EvalRunner(config)
results = await runner.run()

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Inspired by existing evaluation frameworks
Built with Rust for performance and Python for usability
Community contributions and feedback

📞 Support

Made with ❤️ by the Forge Evals Team

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
examples		examples
python/forge_evals		python/forge_evals
rust/forge_evals_core		rust/forge_evals_core
scripts		scripts
tests/python		tests/python
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
METRICS_SUMMARY.md		METRICS_SUMMARY.md
README.md		README.md
pyproject.toml		pyproject.toml

License

khulnasoft/anvil

Folders and files

Latest commit

History

Repository files navigation