LLM-guided optimization for text artifacts using an iterative propose-evaluate-reflect loop with a bring-your-own evaluator.
# 1) Install
curl -fsSL https://raw.githubusercontent.com/ASRagab/optimize-anything/main/install.sh | bash
# 2) Create a seed artifact
echo "Write a concise support prompt" > seed.txt
# 3) Generate a starter evaluator (default: judge/python template)
optimize-anything generate-evaluator seed.txt \
--objective "Score clarity, actionability, and specificity" \
> eval.py
# 4) Optimize
optimize-anything optimize seed.txt \
--judge-model openai/gpt-4o-mini \
--objective "Improve clarity and specificity" \
--model openai/gpt-4o-mini \
--budget 20 \
--parallel --workers 4 \
--cache \
--run-dir runs \
--output result.txtCLI stdout returns a JSON summary — see Result Contract for the full shape.
optimize-anything runs a GEPA (Guided Evolutionary Prompt Algorithm) loop: propose → evaluate → reflect, repeating until budget is exhausted or early stopping kicks in.
seed.txt ──► [Propose] ──► candidates
▲ │
│ [Evaluate]
[Reflect] ◄──── scores + diagnostics
- Propose — The optimizer generates candidate artifacts from your seed (or from scratch in seedless mode).
- Evaluate — Each candidate is scored by your evaluator. Three evaluator types are supported: a command evaluator (any executable that reads JSON on stdin and writes a score on stdout), an HTTP evaluator (a service that accepts POST requests), or a built-in LLM judge (no evaluator script required — just pass
--judge-model). - Reflect — Scores and diagnostics feed back into the next proposal round. The loop continues, progressively improving the artifact toward your objective.
The evaluator is the only thing you bring. Everything else — proposal strategy, reflection, early stopping, caching, parallelism — is handled by the optimizer.
Use --dataset for multi-task optimization (one evaluator call per example). Add --valset for generalization validation.
optimize-anything optimize prompt.txt \
--judge-model openai/gpt-4o-mini \
--objective "Generalize across customer request types" \
--dataset data/train.jsonl \
--valset data/val.jsonl \
--model openai/gpt-4o-mini \
--budget 120 --parallel --workers 6 --cache --run-dir runsCross-check one artifact with multiple judge providers:
optimize-anything validate result.txt \
--providers openai/gpt-4o-mini anthropic/claude-sonnet-4-5 google/gemini-2.0-flash \
--objective "Score clarity, constraints, and robustness" \
--intake-file intake.jsonNo seed file required; GEPA bootstraps from objective.
optimize-anything optimize --no-seed \
--objective "Draft a concise, testable API prompt" \
--model openai/gpt-4o-mini \
--judge-model openai/gpt-4o-mini--no-seed requires both --objective and --model.
- Early stop is auto-enabled when
--budget > 30(or force with--early-stop) - Reuse prior evaluator cache with
--cache-from(requires--cache+--run-dir)
optimize-anything optimize seed.txt \
--evaluator-command bash eval.sh \
--model openai/gpt-4o-mini \
--budget 150 \
--cache --cache-from runs/run-20260303-120000 \
--run-dir runs \
--early-stop --early-stop-window 12 --early-stop-threshold 0.003For command/HTTP evaluators:
--score-range unit(default): enforce score in[0, 1]--score-range any: allow any finite float
optimize-anything optimize seed.txt \
--evaluator-command bash eval.sh \
--model openai/gpt-4o-mini \
--score-range anyoptimizegenerate-evaluatorintakeexplainbudgetscoreanalyzevalidate
optimize-anything is also a Claude Code plugin with guided slash commands and skills.
Use the regression harness when you want to verify that Claude can actually invoke the plugin correctly end-to-end, not merely that the CLI itself still works.
# Direct plugin regression run
uv run python scripts/plugin_regression.py
# Full repo validation including plugin regression
uv run python scripts/check.py --with-pluginRequirements:
claudeCLI installed and authenticatedOPENAI_API_KEYset in the shell that launches the commandANTHROPIC_API_KEYset in the shell that launches the command
The harness runs three real scenarios (analyze, validate, quick), saves Claude JSON outputs plus stderr logs, and fails if Claude does not execute the expected workflow or the optimized artifact is not written.
# In Claude Code
/plugin install ASRagab/optimize-anythingOr clone and install locally:
git clone https://github.com/ASRagab/optimize-anything.git
cd optimize-anything
/plugin install .| Command | Description |
|---|---|
/optimize-anything:optimize |
Guided optimization workflow — walks you through mode selection, evaluator setup, execution, and results |
/optimize-anything:quick |
Zero-config one-shot optimization. Just provide a file and objective. |
/optimize-anything:analyze |
Discover quality dimensions for an artifact and objective |
/optimize-anything:score |
Score a single artifact without optimization |
/optimize-anything:validate |
Cross-validate with multiple LLM judges |
/optimize-anything:compare |
Side-by-side comparison of two artifacts |
/optimize-anything:budget |
Get a budget recommendation for your artifact |
/optimize-anything:explain |
Preview the optimization plan without running it |
/optimize-anything:intake |
Normalize and validate an intake specification |
The plugin includes three skills that Claude Code can invoke automatically:
- optimization-guide — Full workflow walkthrough covering modes, configuration, budget, and result interpretation
- generate-evaluator — Choose the right evaluator pattern (judge, command, composite) and generate a script
- evaluator-patterns — Library of ready-to-use evaluator templates for prompts, code, docs, and agent instructions
/optimize-anything:analyze prompt.txt --objective "Score clarity"
→ discovers quality dimensions
/optimize-anything:quick prompt.txt "improve clarity and specificity"
→ runs analyze + optimize with sensible defaults, shows diff
/optimize-anything:validate result.txt --providers openai/gpt-4o anthropic/claude-sonnet-4-5
→ cross-checks the result with multiple judges
CLI stdout returns a JSON summary with these fields:
best_artifacttotal_metric_callsscore_summary(initial,latest,best, deltas,num_candidates)top_diagnostics(list of{name, value})plateau_detected,plateau_guidance- optional
evaluator_failure_signal - optional
early_stopped,stopped_at_iteration budget_utilization(requested,evaluator_calls,candidates_accepted,efficiency)
Evaluator input payload (stdin JSON for command mode, POST JSON for HTTP mode):
{"_protocol_version": 2, "candidate": "...", "example": {...}, "task_model": "..."}candidateis required_protocol_version,example, andtask_modelare optional/additive- legacy evaluators that only read
candidateremain compatible
Evaluator output payload:
{"score": 0.75, "notes": "optional diagnostics"}scoreis required- additional keys are treated as side-info
Intake is optional structured guidance you pass to the optimizer to shape how evaluation works. Instead of relying solely on an --objective string, intake lets you declare quality dimensions, hard constraints, evaluation patterns, and execution preferences — giving you finer control over what "better" means for your artifact.
Use intake when your evaluation criteria are multi-dimensional, when you need to enforce hard constraints, or when you want consistent evaluation behavior across runs.
Pass it inline or from a file:
# Inline
optimize-anything optimize seed.txt \
--intake-json '{"quality_dimensions": ["clarity", "specificity"], "hard_constraints": ["max 100 words"]}' \
--judge-model openai/gpt-4o-mini
# From file
optimize-anything optimize seed.txt \
--intake-file intake.json \
--judge-model openai/gpt-4o-minioptimize-anything intake normalizes and validates these keys:
artifact_classquality_dimensionshard_constraintsevaluation_patternexecution_modeevaluator_cwd
Exactly one evaluator source is required: --evaluator-command OR --evaluator-url OR --judge-model.
| Flag | Description | Default |
|---|---|---|
--no-seed |
Run without seed file; bootstrap from objective | false |
--evaluator-command <cmd...> |
Command evaluator (stdin/stdout JSON) | -- |
--evaluator-url <url> |
HTTP evaluator endpoint | -- |
--intake-json <json> |
Inline intake spec | -- |
--intake-file <path> |
Intake spec file | -- |
--evaluator-cwd <path> |
Working dir for command evaluator | -- |
--objective <text> |
Optimization objective | -- |
--background <text> |
Extra domain context | -- |
--dataset <train.jsonl> |
Training dataset JSONL | -- |
--valset <val.jsonl> |
Validation dataset JSONL (requires --dataset) |
-- |
--budget <int> |
Max evaluator calls | 100 |
--output, -o <file> |
Write best artifact to file | -- |
--model <model> |
Proposer model (or env fallback) | OPTIMIZE_ANYTHING_MODEL |
--judge-model <model> |
Built-in LLM judge evaluator model | -- |
--judge-objective <text> |
Judge objective override | falls back to --objective |
--api-base <url> |
Override LiteLLM API base | -- |
--diff |
Print unified diff (seed vs best) to stderr | false |
--run-dir <path> |
Save run artifacts in timestamped run dir | -- |
--parallel |
Enable parallel evaluator calls | false |
--workers <int> |
Max workers for parallel evaluation | -- |
--cache |
Enable evaluator cache | false |
--cache-from <run-dir> |
Copy prior fitness_cache into new run |
-- |
--early-stop |
Enable plateau early stop | auto on when budget > 30 |
--early-stop-window <int> |
Plateau window size | 10 |
--early-stop-threshold <float> |
Min improvement required over window | 0.005 |
--spec-file <path> |
Load TOML spec defaults | -- |
--task-model <model> |
Optional metadata forwarded to evaluators | -- |
--score-range unit|any |
Score validation mode for cmd/http | unit |