Rollout documentation

List & pull

List the runnable benchmark packages, then add one to the active workspace:

shell

rollout benchmarks listrollout benchmarks pull terminal-bench

Run against an endpoint

Start a hosted run against your agent endpoint. Rollout sends each task payload to the URL and records the output plus any score or passed value the endpoint returns:

shell

rollout benchmarks run terminal-bench \  --agent-endpoint https://agent.example.com/rollout/eval \  --header "Authorization: Bearer $AGENT_RUN_TOKEN" \  --max-tasks 10

Current behavior

What runs today

Hosted endpoint runs are implemented: Rollout creates attempts, POSTs each task payload to your endpoint, and persists the output plus endpoint-returned score or passed values as evaluations.

--agent-command is accepted by the CLI/API shape but is not executed by the backend yet, and Rollout does not run heavy benchmark verifier harnesses yet.