SlopCodeBench (SCBench)

SlopCodeBench evaluates coding agents under iterative specification refinement: the agent implements a spec, then extends its own code as the spec changes. This exposes behaviors that single-shot benchmarks cannot measure, including path dependence, non-convergence, and trade-offs between explicit handling and structural stability. We release SCBench as an open, community-driven evaluation primitive rather than a finalized benchmark.

We actively want more problems, follow the creating a problem guide and create a pr!.

Note

This is an initial release. We're actively developing and welcome feedback via GitHub Issues.

Prerequisites

Before installing, ensure you have:

Python 3.12+ installed
Docker installed and running (Get Docker)
An API key for your chosen agent (e.g., Anthropic, OpenAI, Google)
8GB+ RAM recommended for running evaluations
10GB+ disk space for Docker images and workspaces

🚀 Install

curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/SprocketLab/slop-code-bench.git && cd slop-code-bench && uv sync
export ANTHROPIC_API_KEY="your-key"

# Run!
uv run slop-code run \
  --agent claude_code \
  --model anthropic/opus-4.5 \
  --environment configs/environments/docker-python3.12-uv.yaml \
  --prompt configs/prompts/just-solve.jinja \
  --problem file_backup \
  --problem execution_server \
  thinking=low \
  version=2.0.51

Parameter Reference:

thinking=none|low|medium|high - Controls extended thinking budget based on agent.
version=X.Y.Z - Agent version to use.

Results are saved to:

outputs/opus-4.5/claude_code-just-solve_low_{timestamp}/

First Run: Docker images build automatically for that VERSION of the agent (5-10 minutes). Subsequent runs are faster.

Troubleshooting

Docker not found:

# Check Docker is running
docker ps
# If not running, start Docker Desktop or daemon

API key not found:

# Verify your environment variable is set
echo $ANTHROPIC_API_KEY
# Or pass it directly
ANTHROPIC_API_KEY="your-key" uv run slop-code run ...

Out of disk space:

# Clean up old Docker images
docker system prune -a

For more issues, see GitHub Issues.

📊 Evaluation

Evaluate a run:

slop-code eval outputs/your-run-directory/

Grade code quality with LLM judge:

slop-code metrics judge \
  --rubric configs/rubrics/llm_judge.jsonl \
  --model <model on openrouter> \
  --criteria-template configs/rubrics/templates/criteria_with_pn.j2 \
  --prefix-template configs/rubrics/templates/no_expl.j2

Contributing

We welcome contributions. Two ways to help:

Add problems — Expand the benchmark with new evaluation scenarios. See the Problem Tutorial and Contributing Guide.
Add agents — Integrate new coding agents. See the Agent Guide and Contributing Guide.

This is early-stage software. Your contributions will shape its direction.

Documentation

Guide	Description
❓ FAQ	Frequently asked questions
📖 Problem Tutorial	Create your first problem (30 min hands-on)
📋 Quick Reference	One-page cheat sheet for problem authoring
🤖 Agent Guide	Configure agents, models, and credentials
🏗️ Architecture	How sessions, workspaces, and runtimes work
✅ Evaluation System	Test cases, adapters, loaders, and verifiers
💡 Problem Design	What makes a good evaluation problem
⚠️ Known Issues	Current limitations and workarounds
📊 Commands	CLI command reference (run, eval, metrics, viz, etc.)

Citing Us

If you found this useful, please cite us as:

@misc{slopcodebench,
  title        = {SlopCodeBench: Measuring Code Erosion Under Iterative Specification Refinement},
  author       = {Gabriel Orlanski and Devjeet Roy and Alexander Yun and 
                  Changho Shin and Alex Gu and Albert Ge and 
                  Dyah Adila and Aws Albarghouthi and Frederic Sala},
  year         = {2025},
  howpublished = {\url{https://github.com/SprocketLab/slop-code-bench}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.claude		.claude
.codex/skills		.codex/skills
assets		assets
configs		configs
docs		docs
examples		examples
problems		problems
scripts		scripts
src/slop_code		src/slop_code
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SlopCodeBench (SCBench)

Prerequisites

🚀 Install

Troubleshooting

📊 Evaluation

Contributing

Documentation

Citing Us

About

Uh oh!

Releases

Packages

Languages

License

SprocketLab/slop-code-bench

Folders and files

Latest commit

History

Repository files navigation

SlopCodeBench (SCBench)

Prerequisites

🚀 Install

Troubleshooting

📊 Evaluation

Contributing

Documentation

Citing Us

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages