Benchmarks & trace quality

Mental model

Benchmarks and traces are separate but connected. The benchmark runner creates a run and an attempt for each task. Your agent endpoint executes the task and returns an output plus an optional score. The SDK emits the observability trace that explains how that output was produced.

Concept	What owns it	Use it for
Benchmark package	Rollout	Seeded task set, metadata, and verifier definitions.
Hosted benchmark run	Rollout	Fan out benchmark tasks to your HTTP endpoint.
Attempt	Rollout	One task execution. Its payload includes the trace id to reuse.
Trace	Your agent SDK	The full execution record for that attempt or production request.
Evaluation	Rollout	Score or pass/fail result returned by the endpoint.

Note

The Python and TypeScript SDKs do not pull benchmark packages, execute hosted runs, or run heavyweight verifier harnesses. Use the UI or CLI for benchmark management, and instrument your agent with an SDK so each attempt has a readable trace.

Run benchmarks

Use the CLI to authenticate, list runnable packages, pull a package into the active workspace, and start a hosted run against an HTTP endpoint you control.

shell

rollout loginrollout check# See runnable benchmark packages.rollout benchmarks list# Add the package to the active workspace as a dataset plus verifiers.rollout benchmarks pull terminal-bench# Start a hosted run against your agent endpoint.rollout benchmarks run terminal-bench \  --agent-endpoint https://agent.example.com/rollout/eval \  --header "Authorization: Bearer $AGENT_RUN_TOKEN" \  --max-tasks 10 \  --timeout-seconds 60

rollout benchmarks pull is idempotent. It adds the benchmark as a dataset and attaches its verifier definitions to the active workspace. rollout benchmarks run creates attempts and POSTs each task to --agent-endpoint.

Flag	Meaning
`--agent-endpoint`	HTTP endpoint that will receive each benchmark task.
`--header`	Forwarded to the endpoint. Use it for endpoint auth tokens.
`--max-tasks`	Optional task cap for smoke tests or partial runs.
`--timeout-seconds`	Per-request timeout, defaulting to 60 seconds.

Heads up

--agent-command exists in the CLI/API shape but is not executed by the backend yet. Hosted endpoint runs are the implemented path today.

Endpoint contract

Rollout POSTs one JSON payload per attempt. The important value for trace correlation is attempt.traceId. Reuse it as the SDK trace id so the benchmark attempt and the observability trace share the same identifier.

request.json

{  "benchmark": { "id": "terminal-bench", "name": "Terminal Bench" },  "run": { "id": "run_123" },  "attempt": { "id": "attempt_123", "traceId": "trace_123" },  "task": {    "instruction": "Complete the benchmark task.",    "input": {},    "metadata": {}  },  "trace": {    "context": {      "run_id": "run_123",      "rollout_id": "attempt_123",      "trace_id": "trace_123"    }  }}

Return an object with output and, when your endpoint can score the attempt, one of score, score: { score: number }, or passed. Rollout stores numeric scores as evaluations for the benchmark run.

response.json

{  "output": "Final answer shown to the benchmark",  "score": { "score": 0.92 },  "passed": true}

Instrument the endpoint

This endpoint uses the benchmark-provided trace id, records visible messages, wraps the solving step in a task span, captures the model call as an LLM span with usage, and emits a final signal for the score.

benchmark_endpoint.py

from fastapi import FastAPI, Header, HTTPExceptionfrom mv37.rollout import Rollout, usage_from_openaifrom openai import OpenAIapp = FastAPI()client = Rollout(agent_name="benchmark_agent", environment="benchmark")openai_client = OpenAI()def score_answer(task: dict, output: str) -> float:    return 1.0 if output.strip() else 0.0@app.post("/rollout/eval")async def run_benchmark_task(payload: dict, authorization: str | None = Header(default=None)):    if authorization != "Bearer expected-agent-token":        raise HTTPException(status_code=401, detail="unauthorized")    attempt = payload.get("attempt", {})    task = payload.get("task", {})    trace_context = payload.get("trace", {}).get("context", {})    with client.trace(        "benchmark_agent",        trace_id=attempt.get("traceId"),        external_trace_id=attempt.get("id"),        attributes={            "benchmark_id": payload.get("benchmark", {}).get("id"),            "run_id": payload.get("run", {}).get("id"),        },        context=trace_context,    ) as trace:        instruction = str(task.get("instruction") or "")        trace.message(role="user", content=instruction)        with trace.span("task", name="solve_benchmark_task") as task_span:            task_span.record_input(task)            with trace.llm("openai.responses", model="gpt-4.1-mini", provider="openai") as llm:                llm.record_input({"instruction": instruction})                response = openai_client.responses.create(                    model="gpt-4.1-mini",                    input=instruction,                )                llm.record_output(response)                llm.set_usage(**usage_from_openai(response))            output = response.output_text            task_span.record_output({"output": output})        score = score_answer(task, output)        trace.message(role="assistant", content=output)        trace.signal("benchmark.score", score)    return {"output": output, "score": {"score": score}, "passed": score >= 0.8}

Tip

For a one-task smoke test, run the same command with --max-tasks 1, open the run in Rollout, then inspect the linked trace before starting a larger benchmark.

Events to send

Prefer high-level SDK methods. They emit the canonical event types, fill timestamps and parent IDs, and keep the trace tree coherent.

Event	High-level API	Send when
`trace.start / trace.end`	`client.trace(...)`	Every user request, background job, conversation turn, or benchmark attempt.
`message`	`trace.message(...)`	User, assistant, system, and tool-visible messages. Mark hidden routing messages internal.
`span.start / span.update / span.end`	`trace.span(...)`	Planner, router, parser, evaluator, retrieval, and custom task steps.
`llm.stream_start / llm.stream_chunk / llm.stream_end`	`trace.llm(..., stream=True)`	Streaming model calls. Persist chunks only when you need token-level replay.
`tool.call / tool.result`	`trace.tool(...)`	Every external tool, function call, browser action, DB query, or API call the agent makes.
`feedback`	`trace.feedback(...)`	Explicit user feedback such as thumbs up, rating, correction, or CSAT.
`user.signal`	`trace.signal(...)`	Behavioral or business outcomes such as resolved, escalated, purchased, score, or passed.
`identity.update`	`client.identify_user(...)`	Stable user/account traits for segmentation and filtering.
`metric / error / eval.result`	`client.capture_event(...)`	Low-level custom events when no high-level SDK method fits.

trace_events.py

with client.trace("support_agent", user_id="cus_123", conversation_id="thread_123") as trace:    trace.message(role="user", content=user_message)    with trace.span("retrieval", name="search_help_center") as span:        span.record_input({"query": user_message})        docs = search(user_message)        span.record_output({"document_ids": [doc.id for doc in docs], "hit_count": len(docs)})    with trace.tool("lookup_order", tool_call_id="call_123", arguments={"order_id": "4421"}) as call:        result = lookup_order("4421")        call.record_output({"status": result.status})    trace.feedback("thumbs_up", True)    trace.signal("order_resolved", True)

Trace quality checklist

A useful trace lets a teammate or browser-capable agent reconstruct what happened without local code access. Optimize for stable identifiers, clear span names, bounded payloads, and explicit outcomes.

Open one trace for the whole unit of work: a request, conversation turn, queue job, or benchmark attempt.
Set stable IDs: user_id, session_id, conversation_id, external_trace_id, and the benchmark-provided trace_id when you have one.
Name spans by purpose, not implementation detail: route_request, search_docs, call_refund_api, judge_answer.
For LLM spans, always include provider, model, input summary, output summary, token usage, and cost when available.
For tools, record sanitized arguments, tool_call_id, retry count, result shape, latency, and errors. Avoid dumping secrets or large raw responses.
For retrieval, record the query, filters, document IDs, hit count, scores, and source collection. Avoid storing whole documents unless they are already safe to retain.
End with an outcome: trace.feedback for explicit user reactions and trace.signal for product or benchmark results.
Use scrubber and before_send to redact or drop sensitive fields before events leave the process.
Call check() during setup and flush() or shutdown() from short-lived processes and serverless handlers.

TypeScript agents

The TypeScript SDK uses the same event taxonomy. The core API is callback-based for traces and spans, and generic provider wrap() is not implemented yet. Use manual spans for direct provider SDK calls, or wrapAISDK for Vercel AI SDK apps.

agent.ts

import { Rollout } from "@mv37/rollout";const rollout = new Rollout({  apiKey: process.env.ROLLOUT_API_KEY,  agentName: "support_agent",  environment: "production",});await rollout.trace("support_agent", async (trace) => {  trace.message({ role: "user", content: "Where is my order?" });  await trace.span(    "llm",    async (span) => {      span.recordInput({ messages: [{ role: "user", content: "Where is my order?" }] });      span.recordOutput({ content: "Your order has shipped." });      span.setUsage({ input_tokens: 120, output_tokens: 24, total_tokens: 144 });    },    { name: "model.call", model: "gpt-4.1-mini", provider: "openai" },  );  trace.signal("order_resolved", true);});await rollout.shutdown();

ai-sdk.ts

import * as ai from "ai";import { Rollout } from "@mv37/rollout";import { eventMetadata, wrapAISDK } from "@mv37/rollout/ai-sdk";const rollout = new Rollout({ apiKey: process.env.ROLLOUT_API_KEY });const wrappedAI = wrapAISDK(ai, { client: rollout });const result = await wrappedAI.generateText({  model,  prompt: "Help the user",  experimental_telemetry: {    metadata: eventMetadata({      agentName: "support_agent",      userId: "user_123",      conversationId: "chat_123",      externalTraceId: "message_123",    }),  },});

Agent-readable docs

These docs are ordinary linked HTML pages with semantic headings, tables, and visible code text. For crawler-style agents, the most important entry points are:

URL	Purpose
/docs/benchmarks	Benchmark runbook, endpoint contract, and trace-quality checklist.
/docs/tracing	Core trace, message, span, LLM, and nesting API.
/docs/api-reference	Complete event taxonomy and public API surface.
/docs/privacy	Scrubbing, before_send, redaction, and sampling guidance.