Mental model
Benchmarks and traces are separate but connected. The benchmark runner creates a run and an attempt for each task. Your agent endpoint executes the task and returns an output plus an optional score. The SDK emits the observability trace that explains how that output was produced.
| Concept | What owns it | Use it for |
|---|---|---|
| Benchmark package | Rollout | Seeded task set, metadata, and verifier definitions. |
| Hosted benchmark run | Rollout | Fan out benchmark tasks to your HTTP endpoint. |
| Attempt | Rollout | One task execution. Its payload includes the trace id to reuse. |
| Trace | Your agent SDK | The full execution record for that attempt or production request. |
| Evaluation | Rollout | Score or pass/fail result returned by the endpoint. |
Note
The Python and TypeScript SDKs do not pull benchmark packages, execute hosted runs, or run heavyweight verifier harnesses. Use the UI or CLI for benchmark management, and instrument your agent with an SDK so each attempt has a readable trace.
Run benchmarks
Use the CLI to authenticate, list runnable packages, pull a package into the active workspace, and start a hosted run against an HTTP endpoint you control.
rollout loginrollout check# See runnable benchmark packages.rollout benchmarks list# Add the package to the active workspace as a dataset plus verifiers.rollout benchmarks pull terminal-bench# Start a hosted run against your agent endpoint.rollout benchmarks run terminal-bench \ --agent-endpoint https://agent.example.com/rollout/eval \ --header "Authorization: Bearer $AGENT_RUN_TOKEN" \ --max-tasks 10 \ --timeout-seconds 60rollout benchmarks pull is idempotent. It adds the benchmark as a dataset and attaches its verifier definitions to the active workspace. rollout benchmarks run creates attempts and POSTs each task to --agent-endpoint.
| Flag | Meaning |
|---|---|
--agent-endpoint | HTTP endpoint that will receive each benchmark task. |
--header | Forwarded to the endpoint. Use it for endpoint auth tokens. |
--max-tasks | Optional task cap for smoke tests or partial runs. |
--timeout-seconds | Per-request timeout, defaulting to 60 seconds. |
Heads up
--agent-command exists in the CLI/API shape but is not executed by the backend yet. Hosted endpoint runs are the implemented path today.
Endpoint contract
Rollout POSTs one JSON payload per attempt. The important value for trace correlation is attempt.traceId. Reuse it as the SDK trace id so the benchmark attempt and the observability trace share the same identifier.
{ "benchmark": { "id": "terminal-bench", "name": "Terminal Bench" }, "run": { "id": "run_123" }, "attempt": { "id": "attempt_123", "traceId": "trace_123" }, "task": { "instruction": "Complete the benchmark task.", "input": {}, "metadata": {} }, "trace": { "context": { "run_id": "run_123", "rollout_id": "attempt_123", "trace_id": "trace_123" } }}Return an object with output and, when your endpoint can score the attempt, one of score, score: { score: number }, or passed. Rollout stores numeric scores as evaluations for the benchmark run.
{ "output": "Final answer shown to the benchmark", "score": { "score": 0.92 }, "passed": true}Instrument the endpoint
This endpoint uses the benchmark-provided trace id, records visible messages, wraps the solving step in a task span, captures the model call as an LLM span with usage, and emits a final signal for the score.
from fastapi import FastAPI, Header, HTTPExceptionfrom mv37.rollout import Rollout, usage_from_openaifrom openai import OpenAIapp = FastAPI()client = Rollout(agent_name="benchmark_agent", environment="benchmark")openai_client = OpenAI()def score_answer(task: dict, output: str) -> float: return 1.0 if output.strip() else 0.0@app.post("/rollout/eval")async def run_benchmark_task(payload: dict, authorization: str | None = Header(default=None)): if authorization != "Bearer expected-agent-token": raise HTTPException(status_code=401, detail="unauthorized") attempt = payload.get("attempt", {}) task = payload.get("task", {}) trace_context = payload.get("trace", {}).get("context", {}) with client.trace( "benchmark_agent", trace_id=attempt.get("traceId"), external_trace_id=attempt.get("id"), attributes={ "benchmark_id": payload.get("benchmark", {}).get("id"), "run_id": payload.get("run", {}).get("id"), }, context=trace_context, ) as trace: instruction = str(task.get("instruction") or "") trace.message(role="user", content=instruction) with trace.span("task", name="solve_benchmark_task") as task_span: task_span.record_input(task) with trace.llm("openai.responses", model="gpt-4.1-mini", provider="openai") as llm: llm.record_input({"instruction": instruction}) response = openai_client.responses.create( model="gpt-4.1-mini", input=instruction, ) llm.record_output(response) llm.set_usage(**usage_from_openai(response)) output = response.output_text task_span.record_output({"output": output}) score = score_answer(task, output) trace.message(role="assistant", content=output) trace.signal("benchmark.score", score) return {"output": output, "score": {"score": score}, "passed": score >= 0.8}Tip
For a one-task smoke test, run the same command with --max-tasks 1, open the run in Rollout, then inspect the linked trace before starting a larger benchmark.
Events to send
Prefer high-level SDK methods. They emit the canonical event types, fill timestamps and parent IDs, and keep the trace tree coherent.
| Event | High-level API | Send when |
|---|---|---|
trace.start / trace.end | client.trace(...) | Every user request, background job, conversation turn, or benchmark attempt. |
message | trace.message(...) | User, assistant, system, and tool-visible messages. Mark hidden routing messages internal. |
span.start / span.update / span.end | trace.span(...) | Planner, router, parser, evaluator, retrieval, and custom task steps. |
llm.stream_start / llm.stream_chunk / llm.stream_end | trace.llm(..., stream=True) | Streaming model calls. Persist chunks only when you need token-level replay. |
tool.call / tool.result | trace.tool(...) | Every external tool, function call, browser action, DB query, or API call the agent makes. |
feedback | trace.feedback(...) | Explicit user feedback such as thumbs up, rating, correction, or CSAT. |
user.signal | trace.signal(...) | Behavioral or business outcomes such as resolved, escalated, purchased, score, or passed. |
identity.update | client.identify_user(...) | Stable user/account traits for segmentation and filtering. |
metric / error / eval.result | client.capture_event(...) | Low-level custom events when no high-level SDK method fits. |
with client.trace("support_agent", user_id="cus_123", conversation_id="thread_123") as trace: trace.message(role="user", content=user_message) with trace.span("retrieval", name="search_help_center") as span: span.record_input({"query": user_message}) docs = search(user_message) span.record_output({"document_ids": [doc.id for doc in docs], "hit_count": len(docs)}) with trace.tool("lookup_order", tool_call_id="call_123", arguments={"order_id": "4421"}) as call: result = lookup_order("4421") call.record_output({"status": result.status}) trace.feedback("thumbs_up", True) trace.signal("order_resolved", True)Trace quality checklist
A useful trace lets a teammate or browser-capable agent reconstruct what happened without local code access. Optimize for stable identifiers, clear span names, bounded payloads, and explicit outcomes.
- Open one trace for the whole unit of work: a request, conversation turn, queue job, or benchmark attempt.
- Set stable IDs:
user_id,session_id,conversation_id,external_trace_id, and the benchmark-providedtrace_idwhen you have one. - Name spans by purpose, not implementation detail:
route_request,search_docs,call_refund_api,judge_answer. - For LLM spans, always include
provider,model, input summary, output summary, token usage, and cost when available. - For tools, record sanitized arguments,
tool_call_id, retry count, result shape, latency, and errors. Avoid dumping secrets or large raw responses. - For retrieval, record the query, filters, document IDs, hit count, scores, and source collection. Avoid storing whole documents unless they are already safe to retain.
- End with an outcome:
trace.feedbackfor explicit user reactions andtrace.signalfor product or benchmark results. - Use
scrubberandbefore_sendto redact or drop sensitive fields before events leave the process. - Call
check()during setup andflush()orshutdown()from short-lived processes and serverless handlers.
TypeScript agents
The TypeScript SDK uses the same event taxonomy. The core API is callback-based for traces and spans, and generic provider wrap() is not implemented yet. Use manual spans for direct provider SDK calls, or wrapAISDK for Vercel AI SDK apps.
import { Rollout } from "@mv37/rollout";const rollout = new Rollout({ apiKey: process.env.ROLLOUT_API_KEY, agentName: "support_agent", environment: "production",});await rollout.trace("support_agent", async (trace) => { trace.message({ role: "user", content: "Where is my order?" }); await trace.span( "llm", async (span) => { span.recordInput({ messages: [{ role: "user", content: "Where is my order?" }] }); span.recordOutput({ content: "Your order has shipped." }); span.setUsage({ input_tokens: 120, output_tokens: 24, total_tokens: 144 }); }, { name: "model.call", model: "gpt-4.1-mini", provider: "openai" }, ); trace.signal("order_resolved", true);});await rollout.shutdown();import * as ai from "ai";import { Rollout } from "@mv37/rollout";import { eventMetadata, wrapAISDK } from "@mv37/rollout/ai-sdk";const rollout = new Rollout({ apiKey: process.env.ROLLOUT_API_KEY });const wrappedAI = wrapAISDK(ai, { client: rollout });const result = await wrappedAI.generateText({ model, prompt: "Help the user", experimental_telemetry: { metadata: eventMetadata({ agentName: "support_agent", userId: "user_123", conversationId: "chat_123", externalTraceId: "message_123", }), },});Agent-readable docs
These docs are ordinary linked HTML pages with semantic headings, tables, and visible code text. For crawler-style agents, the most important entry points are:
| URL | Purpose |
|---|---|
| /docs/benchmarks | Benchmark runbook, endpoint contract, and trace-quality checklist. |
| /docs/tracing | Core trace, message, span, LLM, and nesting API. |
| /docs/api-reference | Complete event taxonomy and public API surface. |
| /docs/privacy | Scrubbing, before_send, redaction, and sampling guidance. |