Guide: GEPA on your agent

The scenario

We want a weak, cheap model to triage a facility-support message into three fields — urgency, sentiment, and a list of service categories— and return clean JSON. The baseline prompt is intentionally underspecified: it names the fields but gives no output format, no allowed values, and no category list. A weak model handed that prompt produces messy, mislabelled, often non-JSON output. GEPA's job is to discover the schema, the enums, and the vocabulary on its own.

expected-output.json

{ "urgency": "high", "sentiment": "negative",  "categories": ["emergency_repair_services", "quality_and_safety_concerns"] }

1. Write the target

The target applies candidate.text as the system prompt and makes a single model call through OpenRouter. Everything else is fixed, so the score reflects the prompt alone.

target.py

import osimport mv37.rollout as rolloutfrom openai import OpenAIBASELINE_PROMPT = (    "You are a support assistant for a facilities-management company. "    "Read the customer message and report how urgent it is, the customer's "    "sentiment, and which service categories it relates to.")client = OpenAI(    base_url="https://openrouter.ai/api/v1",    api_key=os.environ["OPENROUTER_API_KEY"],    timeout=float(os.environ.get("ROLLOUT_GEPA_REQUEST_TIMEOUT", "30")),    max_retries=1,)MODEL = os.environ.get("ROLLOUT_GEPA_STUDENT_MODEL", "google/gemma-3-4b-it")@rollout.optimize(id="facility-extractor-prompt", kind="prompt", baseline=BASELINE_PROMPT)def run_candidate(task: rollout.Task, candidate: rollout.Candidate) -> rollout.AgentResult:    response = client.chat.completions.create(        model=MODEL,        messages=[            {"role": "system", "content": candidate.text},            {"role": "user", "content": task.instruction},        ],    )    return rollout.AgentResult(output=response.choices[0].message.content or "")

2. Build the dataset

The dataset is a Harbor folder of triage messages, each with gold labels compiled into task_expectations — the urgency, sentiment, and categories the answer must mention, as quoted tokens. A splits.json pins which tasks are train, val, and holdout.

task.json

{  "instruction": "The freezer in our main kitchen has completely stopped. We are losing stock by the hour.",  "input": {},  "metadata": {    "expectations": {      "mustMention": [        { "anyOf": ["\"high\""],  "message": "urgency should be high" },        { "anyOf": ["\"negative\""], "message": "sentiment should be negative" },        { "anyOf": ["\"emergency_repair_services\""],          "message": "should include emergency_repair_services" }      ]    }  }}

Tip

Quoting the tokens ("high"with the JSON quotes) means the expectation only matches when the value appears as a JSON string — so prose that merely uses the word "high" does not get credit. That nudges GEPA toward real JSON output.

3. Write the verifier

Three deterministic checks, weighted so labels matter most, then valid JSON, then the required keys. No LLM judge — see Verifiers for the full semantics.

verifier.json

{  "id": "facility-extraction-quality",  "name": "Facility extraction quality",  "kind": "native",  "checks": [    { "id": "field-labels",  "type": "task_expectations", "weight": 4, "params": {} },    { "id": "valid-json",    "type": "json_valid",        "weight": 2, "params": {} },    { "id": "required-keys", "type": "json_keys",         "weight": 1,      "params": { "requiredKeys": ["urgency", "sentiment", "categories"] } }  ]}

4. Create and run

Set your keys, then a single optimize run registers the target, uploads the dataset, creates the verifier, applies the splits, creates the run, and starts GEPA:

shell

export ROLLOUT_API_KEY=rl_...export OPENROUTER_API_KEY=sk-or-...rollout optimize run \  --module app.targets \  --target facility-extractor-prompt \  --dataset ./dataset \  --verifier ./verifier.json \  --splits-file ./splits.json \  --reflection-model openrouter/openai/gpt-4.1 \  --auto light

Note

The target module and your agent code must be importable by the runner process — put them on PYTHONPATH. For a first cheap smoke test of the whole pipeline, add --max-metric-calls 2.

5. Watch it live

Open the run in the Rollout UI. The phase chip walks from evaluating baseline → searching → evaluating best → finalizing; the metric-call bar shows budget spent; the trials feedscrolls each candidate × task with its score and the verifier's feedback; and the candidate list ranks prompts by mean and best score as the search proposes new ones. On a weak student you will see the mean score climb from near 0.1 (prose, no JSON) toward 0.5+ as GEPA discovers the schema. See Running & watching runs for every element on the page.

6. Read the result

When the run finishes you get the canonical candidates and a promotion report comparing the best candidate to the baseline on the holdout split. The winning prompt — now spelling out the JSON schema, the urgency/sentiment enums, and the category vocabulary the baseline never mentioned — is stored on the run, ready to paste back into BASELINE_PROMPT. That is the whole loop: a vague prompt on a weak model becomes a precise one, discovered automatically from your dataset and verifier.

Swapping models

Faster demo: keep the student small — google/gemma-3-4b-it returns in about a second per call, so the sequential stages stay quick.
Stronger student: set ROLLOUT_GEPA_STUDENT_MODELto a larger model. The baseline starts higher and the before/after gap shrinks — a "good model gets even better" story rather than a dramatic rescue.
Reflection model: remember the OpenRouter prefix — openrouter/openai/gpt-4.1, not openai/gpt-4.1 — so litellm uses your OpenRouter key.