optimize-anything

LLM-guided optimization for text artifacts using an iterative propose-evaluate-reflect loop with a bring-your-own evaluator.

Quickstart (v2)

# 1) Install
curl -fsSL https://raw.githubusercontent.com/ASRagab/optimize-anything/main/install.sh | bash

# 2) Create a seed artifact
echo "Write a concise support prompt" > seed.txt

# 3) Generate a starter evaluator (default: judge/python template)
optimize-anything generate-evaluator seed.txt \
  --objective "Score clarity, actionability, and specificity" \
  > eval.py

# 4) Optimize
optimize-anything optimize seed.txt \
  --judge-model openai/gpt-4o-mini \
  --objective "Improve clarity and specificity" \
  --model openai/gpt-4o-mini \
  --budget 20 \
  --parallel --workers 4 \
  --cache \
  --run-dir runs \
  --output result.txt

CLI stdout returns a JSON summary — see Result Contract for the full shape.

How It Works

optimize-anything runs a GEPA (Guided Evolutionary Prompt Algorithm) loop: propose → evaluate → reflect, repeating until budget is exhausted or early stopping kicks in.

seed.txt ──► [Propose] ──► candidates
                 ▲               │
                 │           [Evaluate]
             [Reflect] ◄──── scores + diagnostics

Propose — The optimizer generates candidate artifacts from your seed (or from scratch in seedless mode).
Evaluate — Each candidate is scored by your evaluator. Three evaluator types are supported: a command evaluator (any executable that reads JSON on stdin and writes a score on stdout), an HTTP evaluator (a service that accepts POST requests), or a built-in LLM judge (no evaluator script required — just pass --judge-model).
Reflect — Scores and diagnostics feed back into the next proposal round. The loop continues, progressively improving the artifact toward your objective.

The evaluator is the only thing you bring. Everything else — proposal strategy, reflection, early stopping, caching, parallelism — is handled by the optimizer.

Runtime Modes

Dataset / Valset modes

Use --dataset for multi-task optimization (one evaluator call per example). Add --valset for generalization validation.

optimize-anything optimize prompt.txt \
  --judge-model openai/gpt-4o-mini \
  --objective "Generalize across customer request types" \
  --dataset data/train.jsonl \
  --valset data/val.jsonl \
  --model openai/gpt-4o-mini \
  --budget 120 --parallel --workers 6 --cache --run-dir runs

Multi-provider validation

Cross-check one artifact with multiple judge providers:

optimize-anything validate result.txt \
  --providers openai/gpt-4o-mini anthropic/claude-sonnet-4-5 google/gemini-2.0-flash \
  --objective "Score clarity, constraints, and robustness" \
  --intake-file intake.json

Seedless mode

No seed file required; GEPA bootstraps from objective.

optimize-anything optimize --no-seed \
  --objective "Draft a concise, testable API prompt" \
  --model openai/gpt-4o-mini \
  --judge-model openai/gpt-4o-mini

--no-seed requires both --objective and --model.

Early stopping and cache reuse

Early stop is auto-enabled when --budget > 30 (or force with --early-stop)
Reuse prior evaluator cache with --cache-from (requires --cache + --run-dir)

optimize-anything optimize seed.txt \
  --evaluator-command bash eval.sh \
  --model openai/gpt-4o-mini \
  --budget 150 \
  --cache --cache-from runs/run-20260303-120000 \
  --run-dir runs \
  --early-stop --early-stop-window 12 --early-stop-threshold 0.003

Score range options

For command/HTTP evaluators:

--score-range unit (default): enforce score in [0, 1]
--score-range any: allow any finite float

optimize-anything optimize seed.txt \
  --evaluator-command bash eval.sh \
  --model openai/gpt-4o-mini \
  --score-range any

CLI Subcommands

optimize
generate-evaluator
intake
explain
budget
score
analyze
validate

Claude Code Plugin

optimize-anything is also a Claude Code plugin with guided slash commands and skills.

Plugin Regression Workflow

Use the regression harness when you want to verify that Claude can actually invoke the plugin correctly end-to-end, not merely that the CLI itself still works.

# Direct plugin regression run
uv run python scripts/plugin_regression.py

# Full repo validation including plugin regression
uv run python scripts/check.py --with-plugin

Requirements:

claude CLI installed and authenticated
OPENAI_API_KEY set in the shell that launches the command
ANTHROPIC_API_KEY set in the shell that launches the command

The harness runs three real scenarios (analyze, validate, quick), saves Claude JSON outputs plus stderr logs, and fails if Claude does not execute the expected workflow or the optimized artifact is not written.

Installation

# In Claude Code
/plugin install ASRagab/optimize-anything

Or clone and install locally:

git clone https://github.com/ASRagab/optimize-anything.git
cd optimize-anything
/plugin install .

Slash Commands

Command	Description
`/optimize-anything:optimize`	Guided optimization workflow — walks you through mode selection, evaluator setup, execution, and results
`/optimize-anything:quick`	Zero-config one-shot optimization. Just provide a file and objective.
`/optimize-anything:analyze`	Discover quality dimensions for an artifact and objective
`/optimize-anything:score`	Score a single artifact without optimization
`/optimize-anything:validate`	Cross-validate with multiple LLM judges
`/optimize-anything:compare`	Side-by-side comparison of two artifacts
`/optimize-anything:budget`	Get a budget recommendation for your artifact
`/optimize-anything:explain`	Preview the optimization plan without running it
`/optimize-anything:intake`	Normalize and validate an intake specification

Skills

The plugin includes three skills that Claude Code can invoke automatically:

optimization-guide — Full workflow walkthrough covering modes, configuration, budget, and result interpretation
generate-evaluator — Choose the right evaluator pattern (judge, command, composite) and generate a script
evaluator-patterns — Library of ready-to-use evaluator templates for prompts, code, docs, and agent instructions

Typical Plugin Workflow

/optimize-anything:analyze prompt.txt --objective "Score clarity"
  → discovers quality dimensions

/optimize-anything:quick prompt.txt "improve clarity and specificity"
  → runs analyze + optimize with sensible defaults, shows diff

/optimize-anything:validate result.txt --providers openai/gpt-4o anthropic/claude-sonnet-4-5
  → cross-checks the result with multiple judges

Reference

Result Contract

CLI stdout returns a JSON summary with these fields:

best_artifact
total_metric_calls
score_summary (initial, latest, best, deltas, num_candidates)
top_diagnostics (list of {name, value})
plateau_detected, plateau_guidance
optional evaluator_failure_signal
optional early_stopped, stopped_at_iteration
budget_utilization (requested, evaluator_calls, candidates_accepted, efficiency)

Evaluator Protocol (v2)

Evaluator input payload (stdin JSON for command mode, POST JSON for HTTP mode):

{"_protocol_version": 2, "candidate": "...", "example": {...}, "task_model": "..."}

candidate is required
_protocol_version, example, and task_model are optional/additive
legacy evaluators that only read candidate remain compatible

Evaluator output payload:

{"score": 0.75, "notes": "optional diagnostics"}

score is required
additional keys are treated as side-info

Intake

Intake is optional structured guidance you pass to the optimizer to shape how evaluation works. Instead of relying solely on an --objective string, intake lets you declare quality dimensions, hard constraints, evaluation patterns, and execution preferences — giving you finer control over what "better" means for your artifact.

Use intake when your evaluation criteria are multi-dimensional, when you need to enforce hard constraints, or when you want consistent evaluation behavior across runs.

Pass it inline or from a file:

# Inline
optimize-anything optimize seed.txt \
  --intake-json '{"quality_dimensions": ["clarity", "specificity"], "hard_constraints": ["max 100 words"]}' \
  --judge-model openai/gpt-4o-mini

# From file
optimize-anything optimize seed.txt \
  --intake-file intake.json \
  --judge-model openai/gpt-4o-mini

optimize-anything intake normalizes and validates these keys:

artifact_class
quality_dimensions
hard_constraints
evaluation_pattern
execution_mode
evaluator_cwd

`optimize` flags (complete)

Exactly one evaluator source is required: --evaluator-command OR --evaluator-url OR --judge-model.

Flag	Description	Default
`--no-seed`	Run without seed file; bootstrap from objective	`false`
`--evaluator-command <cmd...>`	Command evaluator (stdin/stdout JSON)	--
`--evaluator-url <url>`	HTTP evaluator endpoint	--
`--intake-json <json>`	Inline intake spec	--
`--intake-file <path>`	Intake spec file	--
`--evaluator-cwd <path>`	Working dir for command evaluator	--
`--objective <text>`	Optimization objective	--
`--background <text>`	Extra domain context	--
`--dataset <train.jsonl>`	Training dataset JSONL	--
`--valset <val.jsonl>`	Validation dataset JSONL (requires `--dataset`)	--
`--budget <int>`	Max evaluator calls	`100`
`--output, -o <file>`	Write best artifact to file	--
`--model <model>`	Proposer model (or env fallback)	`OPTIMIZE_ANYTHING_MODEL`
`--judge-model <model>`	Built-in LLM judge evaluator model	--
`--judge-objective <text>`	Judge objective override	falls back to `--objective`
`--api-base <url>`	Override LiteLLM API base	--
`--diff`	Print unified diff (seed vs best) to stderr	`false`
`--run-dir <path>`	Save run artifacts in timestamped run dir	--
`--parallel`	Enable parallel evaluator calls	`false`
`--workers <int>`	Max workers for parallel evaluation	--
`--cache`	Enable evaluator cache	`false`
`--cache-from <run-dir>`	Copy prior `fitness_cache` into new run	--
`--early-stop`	Enable plateau early stop	auto on when budget > 30
`--early-stop-window <int>`	Plateau window size	`10`
`--early-stop-threshold <float>`	Min improvement required over window	`0.005`
`--spec-file <path>`	Load TOML spec defaults	--
`--task-model <model>`	Optional metadata forwarded to evaluators	--
`--score-range unit\|any`	Score validation mode for cmd/http	`unit`

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.claude-plugin		.claude-plugin
.entire		.entire
.github/workflows		.github/workflows
autoresearch		autoresearch
commands		commands
evaluators		evaluators
examples		examples
scripts		scripts
skills		skills
src/optimize_anything		src/optimize_anything
tests		tests
.gitignore		.gitignore
.last_pytest.txt		.last_pytest.txt
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONCEPTS.md		CONCEPTS.md
EXAMPLES.md		EXAMPLES.md
LICENSE		LICENSE
Makefile		Makefile
PROTOCOL.md		PROTOCOL.md
README.md		README.md
SKILL.md		SKILL.md
WALKTHROUGH.md		WALKTHROUGH.md
evaluator-cookbook.md		evaluator-cookbook.md
install.md		install.md
install.sh		install.sh
pyproject.toml		pyproject.toml
scores.json		scores.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

optimize-anything

Quickstart (v2)

How It Works

Runtime Modes

Dataset / Valset modes

Multi-provider validation

Seedless mode

Early stopping and cache reuse

Score range options

CLI Subcommands

Claude Code Plugin

Plugin Regression Workflow

Installation

Slash Commands

Skills

Typical Plugin Workflow

Reference

Result Contract

Evaluator Protocol (v2)

Intake

`optimize` flags (complete)

Learn More

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

optimize-anything

Quickstart (v2)

How It Works

Runtime Modes

Dataset / Valset modes

Multi-provider validation

Seedless mode

Early stopping and cache reuse

Score range options

CLI Subcommands

Claude Code Plugin

Plugin Regression Workflow

Installation

Slash Commands

Skills

Typical Plugin Workflow

Reference

Result Contract

Evaluator Protocol (v2)

Intake

optimize flags (complete)

Learn More

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`optimize` flags (complete)

Packages