Rollout documentation

Beta

Optimization is driven entirely from the Rollout CLI. You define a target in the Python or TypeScript SDK; the CLI does the dataset upload, verifier creation, run bookkeeping, and the GEPA search itself.

What GEPA does

GEPA (reflective prompt evolution) treats your prompt as the thing to optimize. It runs your agent on a batch of tasks, scores each output with your verifier, reads the per-task feedback, and asks a strong reflection model to propose a better prompt. It repeats that loop, keeping the candidates that score higher, until it exhausts the metric-call budget.

Two models are in play, and they do different jobs:

The student model — the model your agent actually calls. GEPA is trying to make a weak or cheap student perform better by handing it a sharper prompt.
The reflection model — a stronger model GEPA uses to read failures and write improved candidate prompts. It never serves your end users; it only proposes prompts.

The output is a set of scored candidate prompts plus a promotion report telling you whether the best candidate beat your baseline on held-out tasks.

Control plane vs compute plane

The single most important thing to understand is that the work is split in two, and Rollout never calls your models:

architecture.txt

Rollout (control plane)                Your runner (compute plane)─────────────────────────              ────────────────────────────stores run config                      imports your target modulestores dataset + verifier              fetches the run configstores candidates + scores             runs YOUR agent on each taskstores traces                          calls YOUR model providerstores the promotion report            scores outputs with the verifier                                       uploads results back to Rollout         ▲                                          │         └──────────  results upload  ◀─────────────┘

Your provider keys, your agent code, and every model call stay inside your process. Rollout sees the run config, the candidate prompts, the scores, and the traces you emit — never your keys.

The run lifecycle

A single rollout optimize run moves through these phases, which the live run page surfaces as they happen:

Evaluate baseline — your baseline prompt is scored across the dataset so there is a number to beat.
Search — GEPA proposes candidate prompts, evaluates them, reflects on the failures, and proposes again. This is the bulk of the run.
Evaluate best — the top candidate is scored on the holdout split to check it generalizes rather than overfitting the search tasks.
Finalize — Rollout records the canonical candidates and a promotion report comparing the best candidate to the baseline.

Tip

You can watch all four phases live in the Rollout UI — phase chip, metric-call progress, a scrolling trials feed, and a candidate list with mean/best scores. See Running & watching runs.

Where compute can run

Because the runner is just a Python process that imports your code, you can run it anywhere that can reach Rollout and your model provider:

your local terminal, for development
GitHub Actions or another CI runner
Modal, Railway, Render, Fly, ECS, or a Kubernetes worker job
a GPU host, if your target calls a local model

What the runner needs

The compute environment needs only:

rollout-cli[optimize] installed (it pulls in DSPy)
your target module and agent code on PYTHONPATH
ROLLOUT_API_KEY for the workspace
your provider keys, e.g. OPENAI_API_KEY or OPENROUTER_API_KEY
optionally ROLLOUT_BASE_URL for a local or self-hosted Rollout

Note

After rollout optimize create, the dataset and verifier already live in Rollout — the runner no longer needs the local files. It only needs your target module and the runtime secrets above.

Next steps

Define a target — PythonExpose your agent with @rollout.optimize.DatasetsBuild a Harbor dataset and split it into train / val / holdout.VerifiersScore outputs with deterministic, weighted checks.End-to-end guideTarget, dataset, verifier, run, and promotion — start to finish.