Rollout documentation

The verifier spec

A native verifier is a JSON file with an ordered list of checks. Each check has a type, an optional weight (default 1), and params:

refund-quality.json

{  "id": "refund-policy-quality",  "name": "Refund policy quality",  "kind": "native",  "passThreshold": 1.0,  "checks": [    {      "id": "policy-expectations",      "type": "task_expectations",      "weight": 1,      "params": {}    }  ]}

How scoring works

The score for one output is built in two steps:

Within a verifier — every check returns 0, 1, or a fraction (partial credit). The verifier score is the weighted mean of its checks, so a check with weight: 4 counts four times as much as a weight: 1 check.
Across multiple verifiers— the final score is the plain mean of each verifier's score.

passThreshold (default 1.0) sets the score at or above which the output counts as passed. Marking a check "required": true makes any failure of it critical — the output cannot pass regardless of the numeric score. Either way, GEPA optimizes against the continuous score, and the failure reasons from every sub-perfect check are concatenated into the feedback it reads.

Tip

Weights are your steering wheel. In the structured-extraction example, the field-label check carries weight: 4, valid-JSON weight: 2, and required-keys weight: 1 — so the optimizer cares most about getting the labels right, then producing JSON, then including the keys.

Check types

The native verifier supports these check types for local, deterministic scoring:

type	params	Passes when
task_expectations	—	Output satisfies the per-task mustMention / mustNotMention expectations. Graded (partial credit).
contains	`value`, `caseSensitive`?	Output contains the substring (case-insensitive by default). Alias: must_contain.
not_contains	`value`, `caseSensitive`?	Output does not contain the substring. Alias: must_not_contain.
regex	`pattern`	Output matches the regular expression.
equals	`value`, `caseSensitive`?	Trimmed output equals the value. Alias: exact_match.
min_length	`value`	Output is at least value characters long.
max_length	`value`	Output is at most value characters long.
json_valid	—	Output parses as JSON (strict json.loads). The thing weak models fail most.
json_keys	`requiredKeys`	Output is a JSON object containing every required key.
expected_output_schema	—	Output is a JSON object with the required keys from the task's expected output schema.

Note

An unknown check type is skipped (it does not fail the output) and noted in the feedback. A check that raises is treated as a failure with the exception in the reason, so one broken check can never crash the run.

task_expectations in depth

This is the workhorse check. It reads expectations from each task's metadata and grades the output phrase by phrase, giving partial credit and a concrete message per miss:

task-metadata.json

{  "expectations": {    "mustMention": [      { "anyOf": ["refund", "money back"],        "message": "the answer should offer a refund" }    ],    "mustNotMention": [      { "anyOf": ["gift card"],        "message": "do not push a gift card instead of a refund" }    ]  }}

Each entry uses anyOf (a list of acceptable phrases) or a single text phrase. Matching is case-insensitive substring matching.
The score is (total − failures) / total across all mustMention and mustNotMention entries — so missing one of four expectations scores 0.75.
Each entry's message is the failure reason handed to GEPA, so write it as the instruction you wish the prompt had followed.

A worked example

The structured-extraction verifier combines three checks at different weights to grade a triage output (urgency / sentiment / categories):

verifier.json

{  "id": "facility-extraction-quality",  "name": "Facility extraction quality",  "kind": "native",  "checks": [    { "id": "field-labels",  "type": "task_expectations", "weight": 4, "params": {} },    { "id": "valid-json",    "type": "json_valid",        "weight": 2, "params": {} },    { "id": "required-keys", "type": "json_keys",         "weight": 1,      "params": { "requiredKeys": ["urgency", "sentiment", "categories"] } }  ]}

A clean, correct answer scores 1.0; a partially-right one lands around 0.7–0.8; freeform prose that is not JSON scores near 0.1. That spread is what lets the live candidate scores separate as GEPA improves the prompt.

Creating a verifier

Point optimize create/run at the spec file and it is created (or reused) automatically, or create it explicitly:

shell

# created inline when you reference the filerollout optimize run --verifier ./verifiers/refund-quality.json ...# or create it on its ownrollout verifiers create ./verifiers/refund-quality.json# reference an existing verifier by id afterwardsrollout optimize run --verifier refund-policy-quality ...

Next: run it and watch it live, or follow the full end-to-end guide.