Beta
Optimization is driven entirely from the Rollout CLI. You define a target in the Python or TypeScript SDK; the CLI does the dataset upload, verifier creation, run bookkeeping, and the GEPA search itself.
What GEPA does
GEPA (reflective prompt evolution) treats your prompt as the thing to optimize. It runs your agent on a batch of tasks, scores each output with your verifier, reads the per-task feedback, and asks a strong reflection model to propose a better prompt. It repeats that loop, keeping the candidates that score higher, until it exhausts the metric-call budget.
Two models are in play, and they do different jobs:
- The student model — the model your agent actually calls. GEPA is trying to make a weak or cheap student perform better by handing it a sharper prompt.
- The reflection model — a stronger model GEPA uses to read failures and write improved candidate prompts. It never serves your end users; it only proposes prompts.
The output is a set of scored candidate prompts plus a promotion report telling you whether the best candidate beat your baseline on held-out tasks.
Control plane vs compute plane
The single most important thing to understand is that the work is split in two, and Rollout never calls your models:
Rollout (control plane) Your runner (compute plane)───────────────────────── ────────────────────────────stores run config imports your target modulestores dataset + verifier fetches the run configstores candidates + scores runs YOUR agent on each taskstores traces calls YOUR model providerstores the promotion report scores outputs with the verifier uploads results back to Rollout ▲ │ └────────── results upload ◀─────────────┘Your provider keys, your agent code, and every model call stay inside your process. Rollout sees the run config, the candidate prompts, the scores, and the traces you emit — never your keys.
The run lifecycle
A single rollout optimize run moves through these phases, which the live run page surfaces as they happen:
- Evaluate baseline — your baseline prompt is scored across the dataset so there is a number to beat.
- Search — GEPA proposes candidate prompts, evaluates them, reflects on the failures, and proposes again. This is the bulk of the run.
- Evaluate best — the top candidate is scored on the holdout split to check it generalizes rather than overfitting the search tasks.
- Finalize — Rollout records the canonical candidates and a promotion report comparing the best candidate to the baseline.
Tip
You can watch all four phases live in the Rollout UI — phase chip, metric-call progress, a scrolling trials feed, and a candidate list with mean/best scores. See Running & watching runs.
Where compute can run
Because the runner is just a Python process that imports your code, you can run it anywhere that can reach Rollout and your model provider:
- your local terminal, for development
- GitHub Actions or another CI runner
- Modal, Railway, Render, Fly, ECS, or a Kubernetes worker job
- a GPU host, if your target calls a local model
What the runner needs
The compute environment needs only:
rollout-cli[optimize]installed (it pulls in DSPy)- your target module and agent code on
PYTHONPATH ROLLOUT_API_KEYfor the workspace- your provider keys, e.g.
OPENAI_API_KEYorOPENROUTER_API_KEY - optionally
ROLLOUT_BASE_URLfor a local or self-hosted Rollout
Note
After rollout optimize create, the dataset and verifier already live in Rollout — the runner no longer needs the local files. It only needs your target module and the runtime secrets above.