Skip to content

SprocketLab/slop-code-bench

Repository files navigation

SlopCodeBench (SCBench)

Python 3.12+ License: MIT GitHub stars

🌐 Website | 📝 Blog Post

SlopCodeBench evaluates coding agents under iterative specification refinement: the agent implements a spec, then extends its own code as the spec changes. This exposes behaviors that single-shot benchmarks cannot measure, including path dependence, non-convergence, and trade-offs between explicit handling and structural stability. We release SCBench as an open, community-driven evaluation primitive rather than a finalized benchmark.

We actively want more problems, follow the creating a problem guide and create a pr!.

Note

This is an initial release. We're actively developing and welcome feedback via GitHub Issues.

Prerequisites

Before installing, ensure you have:

  • Python 3.12+ installed
  • Docker installed and running (Get Docker)
  • An API key for your chosen agent (e.g., Anthropic, OpenAI, Google)
  • 8GB+ RAM recommended for running evaluations
  • 10GB+ disk space for Docker images and workspaces

🚀 Install

curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/SprocketLab/slop-code-bench.git && cd slop-code-bench && uv sync
export ANTHROPIC_API_KEY="your-key"

# Run!
uv run slop-code run \
  --agent claude_code \
  --model anthropic/opus-4.5 \
  --environment configs/environments/docker-python3.12-uv.yaml \
  --prompt configs/prompts/just-solve.jinja \
  --problem file_backup \
  --problem execution_server \
  thinking=low \
  version=2.0.51

Parameter Reference:

  • thinking=none|low|medium|high - Controls extended thinking budget based on agent.
  • version=X.Y.Z - Agent version to use.

Results are saved to:

outputs/opus-4.5/claude_code-just-solve_low_{timestamp}/

First Run: Docker images build automatically for that VERSION of the agent (5-10 minutes). Subsequent runs are faster.

Troubleshooting

Docker not found:

# Check Docker is running
docker ps
# If not running, start Docker Desktop or daemon

API key not found:

# Verify your environment variable is set
echo $ANTHROPIC_API_KEY
# Or pass it directly
ANTHROPIC_API_KEY="your-key" uv run slop-code run ...

Out of disk space:

# Clean up old Docker images
docker system prune -a

For more issues, see GitHub Issues.

📊 Evaluation

Evaluate a run:

slop-code eval outputs/your-run-directory/

Grade code quality with LLM judge:

slop-code metrics judge \
  --rubric configs/rubrics/llm_judge.jsonl \
  --model <model on openrouter> \
  --criteria-template configs/rubrics/templates/criteria_with_pn.j2 \
  --prefix-template configs/rubrics/templates/no_expl.j2

Contributing

We welcome contributions. Two ways to help:

This is early-stage software. Your contributions will shape its direction.

Documentation

Guide Description
❓ FAQ Frequently asked questions
📖 Problem Tutorial Create your first problem (30 min hands-on)
📋 Quick Reference One-page cheat sheet for problem authoring
🤖 Agent Guide Configure agents, models, and credentials
🏗️ Architecture How sessions, workspaces, and runtimes work
✅ Evaluation System Test cases, adapters, loaders, and verifiers
💡 Problem Design What makes a good evaluation problem
⚠️ Known Issues Current limitations and workarounds
📊 Commands CLI command reference (run, eval, metrics, viz, etc.)

Citing Us

If you found this useful, please cite us as:

@misc{slopcodebench,
  title        = {SlopCodeBench: Measuring Code Erosion Under Iterative Specification Refinement},
  author       = {Gabriel Orlanski and Devjeet Roy and Alexander Yun and 
                  Changho Shin and Alex Gu and Albert Ge and 
                  Dyah Adila and Aws Albarghouthi and Frederic Sala},
  year         = {2025},
  howpublished = {\url{https://github.com/SprocketLab/slop-code-bench}},
}

About

SlopCodeBench: Measuring Code Erosion Under Iterative Specification Refinement

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages