create, dry-run, run
The optimization path is three commands you can run together or separately:
optimize createsyncs the target metadata, imports any local dataset/verifier files, and creates the run. It prints the real run id and the exactruncommand to start the runner.optimize dry-runfetches the run config and checks the target module, dataset splits, verifiers, and DSPy install — without calling any models.optimize runevaluates the baseline, runs GEPA, evaluates the best candidate and holdout, uploads per-task results, and marks the run completed or failed.
rollout optimize create \ --module app.targets \ --target support-agent \ --dataset ./datasets/refunds \ --verifier ./verifiers/quality.json \ --reflection-model openai/gpt-4.1-mini \ --auto lightrollout optimize dry-run opt_123 --module app.targetsrollout optimize run opt_123 --module app.targetsTip
For a quick start, optimize run with no run id does everything in one shot — sync, upload, create, and run. Pass an existing opt_123 id to re-run a run that already exists.
Budget & models
| Flag | Env | Effect |
|---|---|---|
| --reflection-model | ROLLOUT_GEPA_REFLECTION_MODEL | The strong model GEPA uses to propose new prompts. |
| --task-model | ROLLOUT_GEPA_STUDENT_MODEL | Optional DSPy task model (the student your agent runs). |
| --auto light|medium|heavy | — | Preset GEPA budget. Start with light. |
| --max-metric-calls N | — | Exact metric-call budget instead of --auto. |
| --min-holdout-improvement N | — | Require a minimum holdout lift before promotion. |
| --splits-file splits.json | — | Pin task names to train / val / holdout. |
Note
litellm routes the reflection model by its prefix. Through OpenRouter, prefix the upstream model — openrouter/openai/gpt-4.1-mini, not openai/gpt-4.1-mini — so it uses your OpenRouter key.
Concurrency
Every stage is just LLM calls. --concurrency (default 4, or set ROLLOUT_GEPA_CONCURRENCY) parallelizes both the baseline/best scoring loops and GEPA's candidate evaluation, so a fast student finishes the baseline in seconds rather than minutes. Reflection itself stays sequential.
rollout optimize run opt_123 --module app.targets --concurrency 8Heads up
Do not crank concurrency blindly. If the provider rate-limits you, the 429 is caught and scored 0 — so over-threading does not error, it silently corrupts your scores. If results look oddly bad, lower --concurrency (try 1) before trusting the numbers.
The live run page
Open the run in the Rollout UI to watch it as it works. The runner streams progress back continuously, so the page updates without a refresh:
- Phase chip — evaluating baseline → searching → evaluating best → finalizing, so you always know which stage of the lifecycle you are in.
- Metric-call progress — calls spent against the budget, the clearest signal of how far through the search you are.
- Trials feed— a scrolling list of each candidate prompt × task as it is scored, with the score and the verifier's feedback, linking to the trace for that example.
- Candidate list — the candidate prompts ranked by mean and best score, growing as the search proposes and evaluates new ones, with a best-score trend.
- Heartbeat / staleness — a background heartbeat keeps the page live even through long reflection stretches; if the runner goes quiet the page tells you it has not heard from it rather than silently freezing.
Tip
For a cheap end-to-end smoke test of the whole pipeline before committing a real budget, add --max-metric-calls 2.
Reading the result
When the run finishes you get the canonical candidates plus a promotion reportcomparing the best candidate to your baseline on the holdout split. If you set --min-holdout-improvement, the report tells you whether the lift cleared your bar. The winning candidate's prompt is stored on the run, ready to copy back into your agent.
Walk through all of this on a real example in the end-to-end guide.