CookbookEvaluate

Run a benchmark suite

Benchmarks are shared task packages — a dataset and verifier bundled together — that you can pull and run against your agent to get a comparable pass rate. Use them to sanity-check a change before you ship it.

1. Find a benchmark

List the benchmark packages available to your workspace.

shell
rollout loginrollout benchmarks list

2. Pull it into your project

Pulling a benchmark drops its dataset and verifier into your project so you can inspect and version them alongside your code.

shell
rollout benchmarks pull support-triage

3. Run it against your agent

Start a run pointed at your agent endpoint. Each task is scored by the benchmark's verifier, and you get a pass rate plus a trace for every failure.

shell
rollout benchmarks run support-triage# scoring 1,000 tasks ............ done# pass_rate 0.91 · 912/1000 verified

Flags & endpoints

For the exact run flags and how to point a benchmark at a local or hosted endpoint, see the CLI benchmarks reference.