Running & watching runs

create, dry-run, run

The optimization path is three commands you can run together or separately:

optimize create syncs the target metadata, imports any local dataset/verifier files, and creates the run. It prints the real run id and the exact run command to start the runner.
optimize dry-run fetches the run config and checks the target module, dataset splits, verifiers, and DSPy install — without calling any models.
optimize run evaluates the baseline, runs GEPA, evaluates the best candidate and holdout, uploads per-task results, and marks the run completed or failed.

shell

rollout optimize create \  --module app.targets \  --target support-agent \  --dataset ./datasets/refunds \  --verifier ./verifiers/quality.json \  --reflection-model openai/gpt-4.1-mini \  --auto lightrollout optimize dry-run opt_123 --module app.targetsrollout optimize run opt_123 --module app.targets

Tip

For a quick start, optimize run with no run id does everything in one shot — sync, upload, create, and run. Pass an existing opt_123 id to re-run a run that already exists.

Budget & models

Flag	Env	Effect
--reflection-model	ROLLOUT_GEPA_REFLECTION_MODEL	The strong model GEPA uses to propose new prompts.
--task-model	ROLLOUT_GEPA_STUDENT_MODEL	Optional DSPy task model (the student your agent runs).
--auto light\|medium\|heavy	—	Preset GEPA budget. Start with light.
--max-metric-calls N	—	Exact metric-call budget instead of --auto.
--min-holdout-improvement N	—	Require a minimum holdout lift before promotion.
--splits-file splits.json	—	Pin task names to train / val / holdout.

Note

litellm routes the reflection model by its prefix. Through OpenRouter, prefix the upstream model — openrouter/openai/gpt-4.1-mini, not openai/gpt-4.1-mini — so it uses your OpenRouter key.

Concurrency

Every stage is just LLM calls. --concurrency (default 4, or set ROLLOUT_GEPA_CONCURRENCY) parallelizes both the baseline/best scoring loops and GEPA's candidate evaluation, so a fast student finishes the baseline in seconds rather than minutes. Reflection itself stays sequential.

shell

rollout optimize run opt_123 --module app.targets --concurrency 8

Heads up

Do not crank concurrency blindly. If the provider rate-limits you, the 429 is caught and scored 0 — so over-threading does not error, it silently corrupts your scores. If results look oddly bad, lower --concurrency (try 1) before trusting the numbers.

The live run page

Open the run in the Rollout UI to watch it as it works. The runner streams progress back continuously, so the page updates without a refresh:

Phase chip — evaluating baseline → searching → evaluating best → finalizing, so you always know which stage of the lifecycle you are in.
Metric-call progress — calls spent against the budget, the clearest signal of how far through the search you are.
Trials feed— a scrolling list of each candidate prompt × task as it is scored, with the score and the verifier's feedback, linking to the trace for that example.
Candidate list — the candidate prompts ranked by mean and best score, growing as the search proposes and evaluates new ones, with a best-score trend.
Heartbeat / staleness — a background heartbeat keeps the page live even through long reflection stretches; if the runner goes quiet the page tells you it has not heard from it rather than silently freezing.

Tip

For a cheap end-to-end smoke test of the whole pipeline before committing a real budget, add --max-metric-calls 2.

Reading the result

When the run finishes you get the canonical candidates plus a promotion reportcomparing the best candidate to your baseline on the holdout split. If you set --min-holdout-improvement, the report tells you whether the lift cleared your bar. The winning candidate's prompt is stored on the run, ready to copy back into your agent.

Walk through all of this on a real example in the end-to-end guide.