The verifier spec
A native verifier is a JSON file with an ordered list of checks. Each check has a type, an optional weight (default 1), and params:
{ "id": "refund-policy-quality", "name": "Refund policy quality", "kind": "native", "passThreshold": 1.0, "checks": [ { "id": "policy-expectations", "type": "task_expectations", "weight": 1, "params": {} } ]}How scoring works
The score for one output is built in two steps:
- Within a verifier — every check returns
0,1, or a fraction (partial credit). The verifier score is the weighted mean of its checks, so a check withweight: 4counts four times as much as aweight: 1check. - Across multiple verifiers— the final score is the plain mean of each verifier's score.
passThreshold (default 1.0) sets the score at or above which the output counts as passed. Marking a check "required": true makes any failure of it critical — the output cannot pass regardless of the numeric score. Either way, GEPA optimizes against the continuous score, and the failure reasons from every sub-perfect check are concatenated into the feedback it reads.
Tip
Weights are your steering wheel. In the structured-extraction example, the field-label check carries weight: 4, valid-JSON weight: 2, and required-keys weight: 1 — so the optimizer cares most about getting the labels right, then producing JSON, then including the keys.
Check types
The native verifier supports these check types for local, deterministic scoring:
| type | params | Passes when |
|---|---|---|
| task_expectations | — | Output satisfies the per-task mustMention / mustNotMention expectations. Graded (partial credit). |
| contains | value, caseSensitive? | Output contains the substring (case-insensitive by default). Alias: must_contain. |
| not_contains | value, caseSensitive? | Output does not contain the substring. Alias: must_not_contain. |
| regex | pattern | Output matches the regular expression. |
| equals | value, caseSensitive? | Trimmed output equals the value. Alias: exact_match. |
| min_length | value | Output is at least value characters long. |
| max_length | value | Output is at most value characters long. |
| json_valid | — | Output parses as JSON (strict json.loads). The thing weak models fail most. |
| json_keys | requiredKeys | Output is a JSON object containing every required key. |
| expected_output_schema | — | Output is a JSON object with the required keys from the task's expected output schema. |
Note
An unknown check type is skipped (it does not fail the output) and noted in the feedback. A check that raises is treated as a failure with the exception in the reason, so one broken check can never crash the run.
task_expectations in depth
This is the workhorse check. It reads expectations from each task's metadata and grades the output phrase by phrase, giving partial credit and a concrete message per miss:
{ "expectations": { "mustMention": [ { "anyOf": ["refund", "money back"], "message": "the answer should offer a refund" } ], "mustNotMention": [ { "anyOf": ["gift card"], "message": "do not push a gift card instead of a refund" } ] }}- Each entry uses
anyOf(a list of acceptable phrases) or a singletextphrase. Matching is case-insensitive substring matching. - The score is
(total − failures) / totalacross all mustMention and mustNotMention entries — so missing one of four expectations scores0.75. - Each entry's
messageis the failure reason handed to GEPA, so write it as the instruction you wish the prompt had followed.
A worked example
The structured-extraction verifier combines three checks at different weights to grade a triage output (urgency / sentiment / categories):
{ "id": "facility-extraction-quality", "name": "Facility extraction quality", "kind": "native", "checks": [ { "id": "field-labels", "type": "task_expectations", "weight": 4, "params": {} }, { "id": "valid-json", "type": "json_valid", "weight": 2, "params": {} }, { "id": "required-keys", "type": "json_keys", "weight": 1, "params": { "requiredKeys": ["urgency", "sentiment", "categories"] } } ]}A clean, correct answer scores 1.0; a partially-right one lands around 0.7–0.8; freeform prose that is not JSON scores near 0.1. That spread is what lets the live candidate scores separate as GEPA improves the prompt.
Creating a verifier
Point optimize create/run at the spec file and it is created (or reused) automatically, or create it explicitly:
# created inline when you reference the filerollout optimize run --verifier ./verifiers/refund-quality.json ...# or create it on its ownrollout verifiers create ./verifiers/refund-quality.json# reference an existing verifier by id afterwardsrollout optimize run --verifier refund-policy-quality ...Next: run it and watch it live, or follow the full end-to-end guide.