Skip to content

khulnasoft/anvil

Repository files navigation

πŸ”¨ Forge Evals

A high-performance evaluation framework for Large Language Models with Rust-powered execution engine and Python user interface.

✨ Features

  • πŸš€ High Performance: Rust execution engine for parallel processing and caching
  • πŸ”§ Extensible: Plugin architecture for models, evaluators, scorers, and reporters
  • πŸ“Š Rich Reporting: HTML and JSON reports with interactive visualizations
  • 🎯 Multiple Evaluation Types: Multiple choice, freeform, code generation, and more
  • πŸ€– Model Agnostic: Support for OpenAI, Anthropic, and local models
  • πŸ’Ύ Smart Caching: Intelligent caching to avoid redundant computations
  • πŸ“ˆ Benchmark Suite: Built-in benchmarks for math, coding, and reasoning

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/your-org/forge-evals.git
cd forge-evals

# Build and install
./scripts/build.sh

Basic Usage

# Run a benchmark with default models
forge eval benchmarks/math

# Run with specific models
forge eval benchmarks/math --models gpt-4o claude-3-sonnet

# Run with custom configuration
forge run examples/reasoning_eval.yaml

# Generate HTML report
forge eval benchmarks/coding --report html --parallel 8

πŸ“ Project Structure

anvil/
β”œβ”€β”€ python/forge_evals/          # Python interface
β”‚   β”œβ”€β”€ models/                  # Model adapters
β”‚   β”œβ”€β”€ evals/                   # Evaluation tasks
β”‚   β”œβ”€β”€ scoring/                 # Scoring methods
β”‚   β”œβ”€β”€ reporting/               # Report generators
β”‚   └── utils/                   # Utilities
β”œβ”€β”€ rust/forge_evals_core/       # Rust execution engine
β”‚   β”œβ”€β”€ engine.rs                # Core evaluation engine
β”‚   β”œβ”€β”€ executor.rs              # Model execution
β”‚   β”œβ”€β”€ sandbox.rs               # Code execution sandbox
β”‚   β”œβ”€β”€ scoring.rs               # High-performance scoring
β”‚   └── cache.rs                 # Caching system
β”œβ”€β”€ benchmarks/                  # Evaluation benchmarks
β”‚   β”œβ”€β”€ math/                    # Mathematical reasoning
β”‚   β”œβ”€β”€ coding/                  # Code generation
β”‚   └── reasoning/               # Logical reasoning
β”œβ”€β”€ examples/                    # Example configurations
└── scripts/                     # Build and utility scripts

🎯 Evaluation Types

Multiple Choice

type: "multiple_choice"
config:
  options: ["A", "B", "C", "D"]
  require_reasoning: true

Freeform

type: "freeform"
config:
  max_length: 1000
  require_context: false

Code Generation

type: "code"
config:
  language: "python"
  require_tests: true
  max_lines: 50

πŸ€– Model Support

OpenAI

models:
  - name: "gpt-4o"
    provider: "openai"
    model_id: "gpt-4o"
    api_key: "${OPENAI_API_KEY}"

Anthropic

models:
  - name: "claude-3-sonnet"
    provider: "anthropic"
    model_id: "claude-3-sonnet-20240229"
    api_key: "${ANTHROPIC_API_KEY}"

Local Models (Ollama)

models:
  - name: "llama3-70b"
    provider: "local"
    model_id: "llama3:70b"
    api_config:
      command: "ollama"

πŸ“Š Scoring Methods

Exact Match

scoring:
  method: "exact"
  case_sensitive: false
  ignore_punctuation: true

LLM Judge

scoring:
  method: "llm_judge"
  judge_model: "gpt-4"
  threshold: 0.7

Heuristics

scoring:
  method: "heuristics"
  keywords: ["def", "class", "import"]
  threshold: 0.6

πŸ“ˆ Reports

HTML Reports

Interactive reports with:

  • πŸ“Š Visual charts and graphs
  • πŸ“‹ Detailed result tables
  • πŸ” Sample-by-sample analysis
  • πŸ“ˆ Performance metrics

JSON Reports

Structured data for:

  • πŸ€– Programmatic analysis
  • πŸ“Š Custom visualizations
  • πŸ”Œ Integration with other tools

πŸ“š Documentation

🎯 Key Features

Advanced Metrics & Statistics

  • Automated Confidence Intervals: 95% CI for accuracy and mean scores using NumPy
  • Latency Tracking: Per-request execution time with P95 percentiles
  • Token Throughput: Automatic calculation of tokens/second
  • Statistical Rigor: Wald intervals for categorical data, standard error for continuous

Secure Code Execution

  • macOS Sandboxing: Native Seatbelt (sandbox-exec) integration
  • Linux Sandboxing: Bubblewrap (bwrap) namespace isolation
  • Network Isolation: Complete network blocking for untrusted code
  • Filesystem Protection: Read-only system access, writable /tmp only

High-Performance Rust Engine

  • Parallel Execution: Delegatable task scheduling to Rust core
  • FFI Integration: Seamless Python-Rust interop via PyO3
  • Async Scoring: Non-blocking evaluation pipeline
  • Intelligent Caching: Redis-compatible result caching
  • Stats Engine: High-performance statistics with 95% Confidence Intervals using NumPy

πŸ—οΈ Architecture

Python Layer (User Interface)

  • CLI: Command-line interface for easy usage
  • Models: Adapters for different model providers
  • Evaluators: Task-specific evaluation logic
  • Scorers: Flexible scoring algorithms (Async supported)
  • Reporters: Multiple output formats
  • Stats Engine: High-performance statistics with 95% Confidence Intervals using NumPy

Rust Layer (Performance Engine)

  • Engine: Core evaluation orchestration, delegatable via FFI
  • Executor: High-performance model execution
  • Sandbox: Secure code execution environment using macOS Seatbelt (sandbox-exec)
  • Cache: Intelligent caching system
  • Scoring: Optimized scoring algorithms

⚑ Rust Performance Engine

To use the high-performance Rust execution engine instead of the Python runner, set use_rust_engine: true in your configuration or use the --rust flag (coming soon to CLI). This delegates task scheduling and parallelization to the Rust core for maximum throughput and lower overhead.

πŸ›‘οΈ Sandboxing

Coding evaluations are executed in a secure sandbox. On macOS, this uses the Seatbelt (sandbox-exec) facility to restrict network access and file system writes, ensuring that LLM-generated code cannot compromise your host system.

πŸ› οΈ Development

Setup

# Install development dependencies
pip install -e ".[dev]"

# Install Rust toolchain
rustup update stable

# Build in development mode
maturin develop

Running Tests

# Python tests
python -m pytest tests/python/ -v

# Rust tests
cargo test --manifest-path rust/forge_evals_core/Cargo.toml

# Integration tests
python -m pytest tests/ -v

Adding New Components

Model Adapter

# python/forge_evals/models/my_model.py
from .base import BaseModel

class MyModel(BaseModel):
    async def generate(self, messages, **kwargs):
        # Your implementation
        pass

Evaluation Type

# python/forge_evals/evals/my_eval.py
from .base import BaseEval

class MyEval(BaseEval):
    def prepare_messages(self, sample):
        # Your implementation
        pass

Scoring Method

# python/forge_evals/scoring/my_scorer.py
from .base import BaseScorer

class MyScorer(BaseScorer):
    def score(self, sample, result):
        # Your implementation
        pass

πŸ“š Examples

Basic Evaluation

forge eval benchmarks/math --models gpt-4o --parallel 4

Custom Configuration

# my_eval.yaml
name: "Custom Evaluation"
type: "freeform"
dataset:
  file: "my_data.jsonl"
models:
  - name: "gpt-4o"
    provider: "openai"
    model_id: "gpt-4o"
scoring:
  method: "exact"
forge run my_eval.yaml

Programmatic Usage

from forge_evals import EvalRunner, EvalConfig

config = EvalConfig(
    name="My Evaluation",
    dataset=DatasetConfig(path="data.jsonl"),
    models=[ModelConfig(name="gpt-4o", provider="openai", model_id="gpt-4o")],
    scoring=ScoringConfig(method="exact")
)

runner = EvalRunner(config)
results = await runner.run()

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Inspired by existing evaluation frameworks
  • Built with Rust for performance and Python for usability
  • Community contributions and feedback

πŸ“ž Support


Made with ❀️ by the Forge Evals Team

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published