List & pull
List the runnable benchmark packages, then add one to the active workspace:
rollout benchmarks listrollout benchmarks pull terminal-benchRun against an endpoint
Start a hosted run against your agent endpoint. Rollout sends each task payload to the URL and records the output plus any score or passed value the endpoint returns:
rollout benchmarks run terminal-bench \ --agent-endpoint https://agent.example.com/rollout/eval \ --header "Authorization: Bearer $AGENT_RUN_TOKEN" \ --max-tasks 10Current behavior
What runs today
Hosted endpoint runs are implemented: Rollout creates attempts, POSTs each task payload to your endpoint, and persists the output plus endpoint-returned score or passed values as evaluations.
--agent-command is accepted by the CLI/API shape but is not executed by the backend yet, and Rollout does not run heavy benchmark verifier harnesses yet.