SlopCodeBench evaluates coding agents under iterative specification refinement: the agent implements a spec, then extends its own code as the spec changes. This exposes behaviors that single-shot benchmarks cannot measure, including path dependence, non-convergence, and trade-offs between explicit handling and structural stability. We release SCBench as an open, community-driven evaluation primitive rather than a finalized benchmark.
We actively want more problems, follow the creating a problem guide and create a pr!.
Note
This is an initial release. We're actively developing and welcome feedback via GitHub Issues.
Before installing, ensure you have:
- Python 3.12+ installed
- Docker installed and running (Get Docker)
- An API key for your chosen agent (e.g., Anthropic, OpenAI, Google)
- 8GB+ RAM recommended for running evaluations
- 10GB+ disk space for Docker images and workspaces
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/SprocketLab/slop-code-bench.git && cd slop-code-bench && uv sync
export ANTHROPIC_API_KEY="your-key"
# Run!
uv run slop-code run \
--agent claude_code \
--model anthropic/opus-4.5 \
--environment configs/environments/docker-python3.12-uv.yaml \
--prompt configs/prompts/just-solve.jinja \
--problem file_backup \
--problem execution_server \
thinking=low \
version=2.0.51Parameter Reference:
thinking=none|low|medium|high- Controls extended thinking budget based on agent.version=X.Y.Z- Agent version to use.
Results are saved to:
outputs/opus-4.5/claude_code-just-solve_low_{timestamp}/
First Run: Docker images build automatically for that VERSION of the agent (5-10 minutes). Subsequent runs are faster.
Docker not found:
# Check Docker is running
docker ps
# If not running, start Docker Desktop or daemonAPI key not found:
# Verify your environment variable is set
echo $ANTHROPIC_API_KEY
# Or pass it directly
ANTHROPIC_API_KEY="your-key" uv run slop-code run ...Out of disk space:
# Clean up old Docker images
docker system prune -aFor more issues, see GitHub Issues.
Evaluate a run:
slop-code eval outputs/your-run-directory/Grade code quality with LLM judge:
slop-code metrics judge \
--rubric configs/rubrics/llm_judge.jsonl \
--model <model on openrouter> \
--criteria-template configs/rubrics/templates/criteria_with_pn.j2 \
--prefix-template configs/rubrics/templates/no_expl.j2We welcome contributions. Two ways to help:
- Add problems — Expand the benchmark with new evaluation scenarios. See the Problem Tutorial and Contributing Guide.
- Add agents — Integrate new coding agents. See the Agent Guide and Contributing Guide.
This is early-stage software. Your contributions will shape its direction.
| Guide | Description |
|---|---|
| ❓ FAQ | Frequently asked questions |
| 📖 Problem Tutorial | Create your first problem (30 min hands-on) |
| 📋 Quick Reference | One-page cheat sheet for problem authoring |
| 🤖 Agent Guide | Configure agents, models, and credentials |
| 🏗️ Architecture | How sessions, workspaces, and runtimes work |
| ✅ Evaluation System | Test cases, adapters, loaders, and verifiers |
| 💡 Problem Design | What makes a good evaluation problem |
| Current limitations and workarounds | |
| 📊 Commands | CLI command reference (run, eval, metrics, viz, etc.) |
If you found this useful, please cite us as:
@misc{slopcodebench,
title = {SlopCodeBench: Measuring Code Erosion Under Iterative Specification Refinement},
author = {Gabriel Orlanski and Devjeet Roy and Alexander Yun and
Changho Shin and Alex Gu and Albert Ge and
Dyah Adila and Aws Albarghouthi and Frederic Sala},
year = {2025},
howpublished = {\url{https://github.com/SprocketLab/slop-code-bench}},
}