CookbookOptimize

Optimize a prompt with GEPA

GEPA treats your prompt as the thing to optimize. It runs your agent on a batch of tasks, scores each output with your verifier, reads the per-task feedback, and asks a strong reflection model to propose a better prompt — keeping the candidates that score higher. Here is the shortest path from a weak prompt to a promoted one.

1. Install the CLI

The optimize extra pulls in the GEPA search engine. Authenticate once and the CLI remembers your workspace.

shell
pip install "rollout-cli[optimize]"rollout login

2. Expose your agent as a target

Wrap the entry point you want to improve with @rollout.optimize. GEPA passes in a candidate — the prompt it is currently testing — and a task from your dataset. Apply candidate.text and return what your agent produced; everything else stays fixed so the score reflects the prompt alone.

target.py
import mv37.rollout as rolloutfrom openai import OpenAIBASELINE = "You are a support agent. Read the message and reply."client = OpenAI()@rollout.optimize(id="triage-prompt", kind="prompt", baseline=BASELINE)def run_candidate(task: rollout.Task, candidate: rollout.Candidate) -> rollout.AgentResult:    response = client.chat.completions.create(        model="gpt-4.1-mini",        messages=[            {"role": "system", "content": candidate.text},            {"role": "user", "content": task.instruction},        ],    )    return rollout.AgentResult(output=response.choices[0].message.content or "")

Pick a weak baseline

GEPA has the most room to work when the baseline is deliberately underspecified. A vague prompt on a cheap model is the ideal starting point.

3. Point it at a dataset and a verifier

Sync your targets to the workspace, then create a run from a dataset (a Harbor folder of tasks) and a verifier (a JSON spec that scores each output). See the verifier recipe and datasets for those two files.

shell
rollout optimize sync-targetsrollout optimize create \  --target triage-prompt \  --dataset ./support-triage \  --verifier ./refund.json

4. Run it

Kick off the search. GEPA proposes and scores candidate prompts against your dataset until it exhausts the metric-call budget; the live run page shows the candidate list and per-task feedback as it goes.

shell
rollout optimize run triage-prompt# GEPA · scoring candidates ............ done (32s)# best 0.91 · baseline 0.79 · +12% on holdout

5. Read the promotion report

Every run ends with a promotion report comparing the best candidate to your baseline on a held-out split. If the lift cleared your bar, the winning prompt is stored on the run — copy it straight into production. For the narrated, end-to-end version of this recipe, see the GEPA guide.