1. Find a benchmark
List the benchmark packages available to your workspace.
shell
rollout loginrollout benchmarks list2. Pull it into your project
Pulling a benchmark drops its dataset and verifier into your project so you can inspect and version them alongside your code.
shell
rollout benchmarks pull support-triage3. Run it against your agent
Start a run pointed at your agent endpoint. Each task is scored by the benchmark's verifier, and you get a pass rate plus a trace for every failure.
shell
rollout benchmarks run support-triage# scoring 1,000 tasks ............ done# pass_rate 0.91 · 912/1000 verifiedFlags & endpoints
For the exact run flags and how to point a benchmark at a local or hosted endpoint, see the CLI benchmarks reference.