CLI

CLI — benchmarks

Benchmark packages are standardized task suites you run against your own agent endpoint. Rollout drives the run — creating attempts, POSTing each task, and persisting the score your endpoint returns.

List & pull

List the runnable benchmark packages, then add one to the active workspace:

shell
rollout benchmarks listrollout benchmarks pull terminal-bench

Run against an endpoint

Start a hosted run against your agent endpoint. Rollout sends each task payload to the URL and records the output plus any score or passed value the endpoint returns:

shell
rollout benchmarks run terminal-bench \  --agent-endpoint https://agent.example.com/rollout/eval \  --header "Authorization: Bearer $AGENT_RUN_TOKEN" \  --max-tasks 10

Current behavior

What runs today

Hosted endpoint runs are implemented: Rollout creates attempts, POSTs each task payload to your endpoint, and persists the output plus endpoint-returned score or passed values as evaluations.

--agent-command is accepted by the CLI/API shape but is not executed by the backend yet, and Rollout does not run heavy benchmark verifier harnesses yet.