Skip to content

ASRagab/optimize-anything

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

optimize-anything

LLM-guided optimization for text artifacts using an iterative propose-evaluate-reflect loop with a bring-your-own evaluator.

Quickstart (v2)

# 1) Install
curl -fsSL https://raw.githubusercontent.com/ASRagab/optimize-anything/main/install.sh | bash

# 2) Create a seed artifact
echo "Write a concise support prompt" > seed.txt

# 3) Generate a starter evaluator (default: judge/python template)
optimize-anything generate-evaluator seed.txt \
  --objective "Score clarity, actionability, and specificity" \
  > eval.py

# 4) Optimize
optimize-anything optimize seed.txt \
  --judge-model openai/gpt-4o-mini \
  --objective "Improve clarity and specificity" \
  --model openai/gpt-4o-mini \
  --budget 20 \
  --parallel --workers 4 \
  --cache \
  --run-dir runs \
  --output result.txt

CLI stdout returns a JSON summary — see Result Contract for the full shape.

How It Works

optimize-anything runs a GEPA (Guided Evolutionary Prompt Algorithm) loop: propose → evaluate → reflect, repeating until budget is exhausted or early stopping kicks in.

seed.txt ──► [Propose] ──► candidates
                 ▲               │
                 │           [Evaluate]
             [Reflect] ◄──── scores + diagnostics
  1. Propose — The optimizer generates candidate artifacts from your seed (or from scratch in seedless mode).
  2. Evaluate — Each candidate is scored by your evaluator. Three evaluator types are supported: a command evaluator (any executable that reads JSON on stdin and writes a score on stdout), an HTTP evaluator (a service that accepts POST requests), or a built-in LLM judge (no evaluator script required — just pass --judge-model).
  3. Reflect — Scores and diagnostics feed back into the next proposal round. The loop continues, progressively improving the artifact toward your objective.

The evaluator is the only thing you bring. Everything else — proposal strategy, reflection, early stopping, caching, parallelism — is handled by the optimizer.

Runtime Modes

Dataset / Valset modes

Use --dataset for multi-task optimization (one evaluator call per example). Add --valset for generalization validation.

optimize-anything optimize prompt.txt \
  --judge-model openai/gpt-4o-mini \
  --objective "Generalize across customer request types" \
  --dataset data/train.jsonl \
  --valset data/val.jsonl \
  --model openai/gpt-4o-mini \
  --budget 120 --parallel --workers 6 --cache --run-dir runs

Multi-provider validation

Cross-check one artifact with multiple judge providers:

optimize-anything validate result.txt \
  --providers openai/gpt-4o-mini anthropic/claude-sonnet-4-5 google/gemini-2.0-flash \
  --objective "Score clarity, constraints, and robustness" \
  --intake-file intake.json

Seedless mode

No seed file required; GEPA bootstraps from objective.

optimize-anything optimize --no-seed \
  --objective "Draft a concise, testable API prompt" \
  --model openai/gpt-4o-mini \
  --judge-model openai/gpt-4o-mini

--no-seed requires both --objective and --model.

Early stopping and cache reuse

  • Early stop is auto-enabled when --budget > 30 (or force with --early-stop)
  • Reuse prior evaluator cache with --cache-from (requires --cache + --run-dir)
optimize-anything optimize seed.txt \
  --evaluator-command bash eval.sh \
  --model openai/gpt-4o-mini \
  --budget 150 \
  --cache --cache-from runs/run-20260303-120000 \
  --run-dir runs \
  --early-stop --early-stop-window 12 --early-stop-threshold 0.003

Score range options

For command/HTTP evaluators:

  • --score-range unit (default): enforce score in [0, 1]
  • --score-range any: allow any finite float
optimize-anything optimize seed.txt \
  --evaluator-command bash eval.sh \
  --model openai/gpt-4o-mini \
  --score-range any

CLI Subcommands

  • optimize
  • generate-evaluator
  • intake
  • explain
  • budget
  • score
  • analyze
  • validate

Claude Code Plugin

optimize-anything is also a Claude Code plugin with guided slash commands and skills.

Plugin Regression Workflow

Use the regression harness when you want to verify that Claude can actually invoke the plugin correctly end-to-end, not merely that the CLI itself still works.

# Direct plugin regression run
uv run python scripts/plugin_regression.py

# Full repo validation including plugin regression
uv run python scripts/check.py --with-plugin

Requirements:

  • claude CLI installed and authenticated
  • OPENAI_API_KEY set in the shell that launches the command
  • ANTHROPIC_API_KEY set in the shell that launches the command

The harness runs three real scenarios (analyze, validate, quick), saves Claude JSON outputs plus stderr logs, and fails if Claude does not execute the expected workflow or the optimized artifact is not written.

Installation

# In Claude Code
/plugin install ASRagab/optimize-anything

Or clone and install locally:

git clone https://github.com/ASRagab/optimize-anything.git
cd optimize-anything
/plugin install .

Slash Commands

Command Description
/optimize-anything:optimize Guided optimization workflow — walks you through mode selection, evaluator setup, execution, and results
/optimize-anything:quick Zero-config one-shot optimization. Just provide a file and objective.
/optimize-anything:analyze Discover quality dimensions for an artifact and objective
/optimize-anything:score Score a single artifact without optimization
/optimize-anything:validate Cross-validate with multiple LLM judges
/optimize-anything:compare Side-by-side comparison of two artifacts
/optimize-anything:budget Get a budget recommendation for your artifact
/optimize-anything:explain Preview the optimization plan without running it
/optimize-anything:intake Normalize and validate an intake specification

Skills

The plugin includes three skills that Claude Code can invoke automatically:

  • optimization-guide — Full workflow walkthrough covering modes, configuration, budget, and result interpretation
  • generate-evaluator — Choose the right evaluator pattern (judge, command, composite) and generate a script
  • evaluator-patterns — Library of ready-to-use evaluator templates for prompts, code, docs, and agent instructions

Typical Plugin Workflow

/optimize-anything:analyze prompt.txt --objective "Score clarity"
  → discovers quality dimensions

/optimize-anything:quick prompt.txt "improve clarity and specificity"
  → runs analyze + optimize with sensible defaults, shows diff

/optimize-anything:validate result.txt --providers openai/gpt-4o anthropic/claude-sonnet-4-5
  → cross-checks the result with multiple judges

Reference

Result Contract

CLI stdout returns a JSON summary with these fields:

  • best_artifact
  • total_metric_calls
  • score_summary (initial, latest, best, deltas, num_candidates)
  • top_diagnostics (list of {name, value})
  • plateau_detected, plateau_guidance
  • optional evaluator_failure_signal
  • optional early_stopped, stopped_at_iteration
  • budget_utilization (requested, evaluator_calls, candidates_accepted, efficiency)

Evaluator Protocol (v2)

Evaluator input payload (stdin JSON for command mode, POST JSON for HTTP mode):

{"_protocol_version": 2, "candidate": "...", "example": {...}, "task_model": "..."}
  • candidate is required
  • _protocol_version, example, and task_model are optional/additive
  • legacy evaluators that only read candidate remain compatible

Evaluator output payload:

{"score": 0.75, "notes": "optional diagnostics"}
  • score is required
  • additional keys are treated as side-info

Intake

Intake is optional structured guidance you pass to the optimizer to shape how evaluation works. Instead of relying solely on an --objective string, intake lets you declare quality dimensions, hard constraints, evaluation patterns, and execution preferences — giving you finer control over what "better" means for your artifact.

Use intake when your evaluation criteria are multi-dimensional, when you need to enforce hard constraints, or when you want consistent evaluation behavior across runs.

Pass it inline or from a file:

# Inline
optimize-anything optimize seed.txt \
  --intake-json '{"quality_dimensions": ["clarity", "specificity"], "hard_constraints": ["max 100 words"]}' \
  --judge-model openai/gpt-4o-mini

# From file
optimize-anything optimize seed.txt \
  --intake-file intake.json \
  --judge-model openai/gpt-4o-mini

optimize-anything intake normalizes and validates these keys:

  • artifact_class
  • quality_dimensions
  • hard_constraints
  • evaluation_pattern
  • execution_mode
  • evaluator_cwd

optimize flags (complete)

Exactly one evaluator source is required: --evaluator-command OR --evaluator-url OR --judge-model.

Flag Description Default
--no-seed Run without seed file; bootstrap from objective false
--evaluator-command <cmd...> Command evaluator (stdin/stdout JSON) --
--evaluator-url <url> HTTP evaluator endpoint --
--intake-json <json> Inline intake spec --
--intake-file <path> Intake spec file --
--evaluator-cwd <path> Working dir for command evaluator --
--objective <text> Optimization objective --
--background <text> Extra domain context --
--dataset <train.jsonl> Training dataset JSONL --
--valset <val.jsonl> Validation dataset JSONL (requires --dataset) --
--budget <int> Max evaluator calls 100
--output, -o <file> Write best artifact to file --
--model <model> Proposer model (or env fallback) OPTIMIZE_ANYTHING_MODEL
--judge-model <model> Built-in LLM judge evaluator model --
--judge-objective <text> Judge objective override falls back to --objective
--api-base <url> Override LiteLLM API base --
--diff Print unified diff (seed vs best) to stderr false
--run-dir <path> Save run artifacts in timestamped run dir --
--parallel Enable parallel evaluator calls false
--workers <int> Max workers for parallel evaluation --
--cache Enable evaluator cache false
--cache-from <run-dir> Copy prior fitness_cache into new run --
--early-stop Enable plateau early stop auto on when budget > 30
--early-stop-window <int> Plateau window size 10
--early-stop-threshold <float> Min improvement required over window 0.005
--spec-file <path> Load TOML spec defaults --
--task-model <model> Optional metadata forwarded to evaluators --
--score-range unit|any Score validation mode for cmd/http unit

Learn More

About

Optimize any text artifact using gepa — prompts, code, configs, skills

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors