Agent Fleet Observability: Real-Time Telemetry and Dashboard Patterns for Multi-Agent Systems

Executive Summary

The LLM observability market reached $1.97 billion in 2025 and is projected to hit $6.8 billion by 2029 at a 36.5% CAGR — growth driven almost entirely by the shift from single-agent to fleet-scale deployments. Yet 89% of organizations that have implemented some form of agent observability still cite quality failures as their primary production barrier, not infrastructure issues. The problem is not that teams are ignoring observability; it is that they are importing individual-agent monitoring patterns into fleet contexts where they do not scale.

Fleet observability is categorically different from single-agent observability. When you operate one agent, you can read a trace, notice the stuck tool call, and fix it. When you operate fifty concurrent agents — each with its own model, context window, tool set, and task queue — manual inspection collapses. Fleet observability requires aggregate metrics that surface fleet-wide anomalies before they become failures, real-time streaming that delivers signal without melting your browser, and a cost attribution model granular enough to answer "which task type is burning our budget" rather than just "we spent $X this month." Critically, it requires the orchestrator-worker pattern to account for roughly 70% of production multi-agent deployments, yet most observability tooling was designed around the single-agent request-response cycle.

This article addresses the architectural gap. We cover how to instrument heterogeneous agent fleets using OpenTelemetry GenAI semantic conventions — the standard that emerged from the January 2025 consolidation of OpenLLMetry into the OTel core specification. We examine dashboard design patterns that scale from ten agents to thousands, the transport protocol tradeoffs between SSE, WebSocket, and gRPC for live data feeds, the key metrics that separate production-grade from best-effort monitoring, and the fleet commander pattern where a meta-agent consumes its own fleet's telemetry to make routing and throttling decisions. The goal is a reference architecture that production teams can implement today.

The Fleet Observability Problem Space

Before examining solutions, it is worth being precise about what fleet observability needs to solve that single-agent observability does not.

Scale of concurrent activity. An individual agent trace is a sequential or mildly parallel tree of spans. A fleet trace is a forest — hundreds of simultaneous trees, each progressing at different speeds, some stuck, some completing, some spawning sub-agents. The first challenge is rendering this forest legibly. A waterfall view designed for a ten-span trace becomes unusable at ten thousand spans.

Heterogeneous runtimes and models. Production fleets rarely use a single model or a single runtime. You might have Claude Sonnet 4 agents handling code review, Haiku agents handling summarization, a GPT-4o agent for legacy compatibility, and a fine-tuned model for domain-specific extraction. Each provider emits telemetry in different shapes. Each runtime has different context window limits, different cost curves, and different failure modes. Fleet observability must normalize across this heterogeneity before it can aggregate.

Cost attribution across task hierarchies. A user request processed by an orchestrator that fans out to four workers produces costs in all four workers plus the orchestrator itself. Attributing the total cost to the originating user, feature, or team requires tracing the entire causal chain — which only works if the orchestrator propagates trace context into every downstream dispatch.

Fleet-wide failure modes. Many failure modes only become visible at the fleet level. A stuck agent is a single-agent problem. A 40% increase in stuck agents across the fleet within a five-minute window is a fleet-level anomaly that could indicate a tool API degradation, a prompt regression after a deployment, or a context window misconfiguration. These patterns require aggregate metrics with time-series baselines, not individual trace inspection.

Operational control feedback loops. Fleet observability should not be purely read-only. The data should drive automated responses: throttle an agent burning tokens at 10x its baseline, reroute tasks from a degraded worker pool to a healthy one, pause a runaway agent before it exhausts its cost budget. This feedback loop from telemetry to action is what separates passive monitoring from active fleet management.

OpenTelemetry GenAI Semantic Conventions: The Standard Layer

Any production fleet observability stack should be built on the OpenTelemetry GenAI semantic conventions, not on proprietary schemas. January 2025 was the inflection point: OpenLLMetry's conventions were formally merged into the OTel specification, ending the fragmentation that had forced teams to choose between incompatible schemas. OTel v1.39 added MCP (Model Context Protocol) semantic conventions, addressing the observability gap created by MCP's rapid 2025 adoption.

Core Span Attributes

Every LLM call in a fleet should emit spans with these standard attributes:

Request-time attributes:

gen_ai.system — provider identifier: openai, anthropic, aws.bedrock, google.vertex_ai
gen_ai.request.model — the model name as requested: claude-sonnet-4-5, gpt-4o, gemini-2.0-flash
gen_ai.request.max_tokens — the configured token limit
gen_ai.request.temperature — sampling temperature
gen_ai.system_instructions — system prompt content (when content recording is enabled and PII policy permits)

Response-time attributes:

gen_ai.response.model — the actual model version that served the request (may differ from requested model when using API-level routing)
gen_ai.response.id — response identifier for correlation
gen_ai.response.finish_reasons — array: stop, length, tool_calls, content_filter

Usage attributes — the foundation of cost attribution:

gen_ai.usage.input_tokens — prompt token count
gen_ai.usage.output_tokens — completion token count
Cache token attributes (provider-normalized, added in 2025) for prompt caching cost calculations

Agent-specific:

gen_ai.agent.version — version identifier for the agent definition (recently added to the spec)

Metrics:

gen_ai.client.token.usage — histogram of token usage per call
gen_ai.client.operation.duration — histogram of LLM operation duration

To opt into the latest experimental conventions: OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental.

Agentic Framework Conventions

The core span/metric attributes cover individual LLM calls. For full fleet observability, you need the emerging agentic system conventions tracked in OTel semantic-conventions issue #2664. The proposed model covers:

Tasks — the minimal trackable work unit, decomposable into subtasks
Actions — tool calls, LLM queries, API requests, vector database queries, human input, sub-workflow invocations
Agents — individual agent identity and version
Teams — groups of coordinating agents (a fleet is a team)
Artifacts — inputs and outputs of agent tasks
Memory — agent memory state observations

This vocabulary maps cleanly onto the orchestrator-worker pattern: a fleet task decomposes into worker subtasks; each worker executes actions; the fleet commander coordinates the team.

Arize Phoenix's Span Kind Taxonomy

The most complete fleet-ready span kind taxonomy in production today is from Arize Phoenix (open-source, Apache 2.0): ten distinct span kinds — CHAIN, LLM, TOOL, RETRIEVER, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR. The AGENT kind specifically models fleet nodes; the EVALUATOR kind models in-line quality evaluation; GUARDRAIL models policy enforcement spans. Phoenix's OpenInference instrumentation largely adopted OTel gen_ai conventions, and the two specs converged in 2025. Phoenix supports all major frameworks: OpenAI Agents SDK, Claude Agent SDK, LangGraph, Vercel AI SDK, Mastra, CrewAI, LlamaIndex, DSPy — and all major providers: OpenAI, Anthropic, Google ADK, AWS Bedrock, OpenRouter, LiteLLM.

Real-Time Transport Architecture: SSE vs WebSocket vs gRPC

The choice of transport for live dashboard data is not primarily about features — SSE, WebSocket, and gRPC streaming all deliver real-time data. The choice is about operational complexity and failure modes at fleet scale.

The Decision Matrix by Layer

Layer	Protocol	Rationale
LLM API → Application	SSE	Provider-standard; token streaming is universally implemented as SSE
Application → Browser Dashboard	SSE	Lightweight; stateless; no WebSocket overhead for read-only feeds
Interactive Control Plane	WebSocket	Bidirectional: operator sends commands, agent sends telemetry
Agent → OTel Collector	gRPC (OTLP)	High-throughput binary transport; schema enforcement via Protocol Buffers
Collector → Backend Store	gRPC (OTLP) or HTTP/Protobuf	Standard OTLP export paths

Server-Sent Events: The Dominant Pattern

SSE is the de facto standard for LLM streaming. OpenAI, Anthropic, Google Gemini, Mistral, and virtually every inference provider implement token streaming via SSE. The protocol is HTTP/1.1 compatible, requires no upgrade handshake, auto-reconnects without client code, and uses the browser's native EventSource API.

For fleet dashboards, the properties that make SSE attractive are the same ones that make LLM streaming work: stateless server-side, horizontally scalable without sticky sessions or socket brokers, and trivially proxied through standard HTTP infrastructure. A dashboard reading fleet-wide metrics via SSE can run behind any load balancer without special routing rules.

The concrete SSE stream format for a fleet metrics feed:

event: fleet_snapshot
data: {"active_agents": 47, "queue_depth": 12, "tokens_per_sec": 8420, "cost_usd_per_hour": 3.14}

event: agent_update
data: {"agent_id": "worker-42", "state": "active", "context_utilization": 0.73, "task_id": "t-881"}

event: alert
data: {"severity": "warning", "metric": "context_utilization", "agent_id": "worker-17", "value": 0.91}

WebSocket: When You Need Bidirectional Control

WebSocket becomes necessary when the dashboard operator needs to send commands back to the fleet — pause an agent, drain a queue, change a routing rule, or trigger an emergency cost cap. The control plane of a mature fleet management system is WebSocket; the telemetry feed can remain SSE.

The operational cost of WebSocket at scale is real: long-lived stateful connections require sticky session routing at the load balancer, connection pooling for memory management, broker infrastructure (Redis Pub/Sub, NATS) to fan out events to connected clients, and reconnection logic with exponential backoff. None of these are unsolvable, but they are engineering work that SSE-only architectures avoid.

gRPC/OTLP: The Agent-to-Collector Pipe

Between agents and the OpenTelemetry Collector, gRPC with OTLP (OpenTelemetry Protocol) is the correct choice. Binary encoding with Protocol Buffers achieves higher throughput than JSON-over-HTTP for high-volume telemetry. Bidirectional streaming supports both push (agents push spans to the collector) and pull (collector requests metric snapshots). Schema enforcement via .proto definitions prevents malformed telemetry from polluting the store. The OTLP/gRPC path also supports compression (gzip, zstd) which matters when exporting large context content for debugging.

gRPC is not browser-native, so it is limited to the agent-to-collector and collector-to-backend segments. Browser dashboards should use SSE or WebSocket, not gRPC-Web, unless the complexity tradeoff is explicitly justified.

Key Metrics: What to Collect at Fleet Scale

The following metric taxonomy is what production systems actually need to answer operational questions. Not all metrics need to be collected at the same granularity — the table below includes recommended resolution and alert thresholds where industry practice has converged.

Context and Token Metrics

Context window utilization is the single most operationally important metric for running healthy agents. The formula: (prompt_tokens + context_tokens) / model_context_limit × 100. At 80% utilization, agents should trigger a yellow alert — they are at risk of context truncation, which silently degrades task quality without producing an error. At 95%, they should trigger a red alert or automated compaction. This 80% threshold has become the informal industry standard across Langfuse, Helicone, and Datadog documentation.

Different models have different limits to normalize against: Claude Sonnet 4.5 at 200,000 tokens, GPT-4o at 128,000 tokens, Gemini 2.0 Flash at 1,000,000 tokens. Fleet dashboards must store the context limit per model, not a fleet-wide constant.

Token throughput (tokens per second, output only) is a leading indicator of model performance degradation. A sudden drop in tokens/sec across multiple agents simultaneously — when request volume has not changed — suggests provider-side capacity issues or API latency degradation. Track at 30-second resolution; alert on 30% drop from 5-minute baseline.

Token consumption rate measures how fast a running agent is consuming its cost budget. An agent consuming 10,000 tokens per minute on a task that historically averages 2,000 tokens per minute is in a reasoning loop and should be flagged for inspection or automated termination.

Cost Metrics

Cost per task is the primary unit of fleet economics. Calculated as: (input_tokens × model_input_rate_per_1M + output_tokens × model_output_rate_per_1M) / 1_000_000. Current pricing reference points (June 2026): Claude Sonnet 4.5 at $3.00/$15.00 per million input/output tokens; GPT-4o at $2.50/$10.00; Gemini 2.0 Flash at $0.075/$0.30. Cost per task enables the fleet commander to make cost-aware routing decisions — routing to a cheaper model when task complexity does not require a frontier model.

Cost burn rate (USD/hour) is the fleet-wide aggregate. This is the metric that finance and operations teams actually care about. Dashboard presentation: current rate as a number, 24-hour trend as a sparkline, monthly projection extrapolated from current rate. Alert threshold: 3× the rolling 1-hour average, which catches runaway cost events without alerting on normal variance.

Cost per agent enables attribution of budget consumption to specific agent types. An agent consuming 40% of fleet budget while processing 10% of tasks is either working on genuinely expensive tasks (frontier model, large context) or malfunctioning.

Cost attribution by feature and user requires custom metadata. Helicone implements this via Helicone-Property-* headers: Helicone-Property-feature: code-review, Helicone-Property-user-tier: enterprise. Langfuse uses metadata fields on trace objects. The important design principle: cost attribution metadata must be propagated through the entire orchestrator-worker call chain, not just set at the entry point.

Queue and Throughput Metrics

Queue depth is the number of tasks waiting for agent assignment. Combined with agent utilization (active agents / provisioned agents), it determines whether the fleet is under-provisioned, appropriately loaded, or over-provisioned. A queue depth of zero with utilization under 30% suggests over-provisioning. Queue depth growing faster than tasks are completing suggests a throughput bottleneck that needs investigation before it becomes a user-facing latency problem.

Task dispatch latency — time from task creation to agent assignment — should be tracked at p50, p95, and p99. Elevated p99 dispatch latency often indicates that the routing logic (the fleet commander's model selection and load balancing) is itself becoming a bottleneck.

Parallel agent count over time reveals fleet behavior patterns: predictable workload peaks, unexpected spikes (potential runaway spawning), and periods of underutilization. This is the metric that informs autoscaling decisions.

Agent Health State Machine

Each agent in the fleet should be modeled as a state machine. The minimal viable state set:

IDLE — provisioned, no active task
ACTIVE — executing a task, emitting spans
WAITING — blocked on external resource (tool call in progress, human approval pending)
STUCK — in ACTIVE or WAITING state for longer than the task's expected duration without progress
ERROR — terminal failure, task not completed
COMPLETED — task finished successfully

The STUCK state is the most operationally important and the most frequently missed. An agent stuck in a tool call loop produces no error spans — from the infrastructure layer, it looks healthy. Detection requires a heartbeat model: if an agent has been in ACTIVE state for N × expected_task_duration without emitting a new span, it transitions to STUCK and triggers a human-review alert.

Dashboard Design Patterns for Fleet-Level Visibility

The hierarchy of fleet dashboards follows a drill-down pattern: fleet overview → agent group → individual agent → trace detail. Each level surfaces different signal; effective fleet management requires all four levels but with different update frequencies and visual densities.

Fleet Overview Dashboard

The fleet overview is an always-on panel, typically on a wall display or permanently open browser tab. Update frequency: 5–10 seconds. The critical principle: the overview must provide a complete health picture in a single glance, without requiring any clicks.

Required panels for a production fleet overview:

Active agent count with state breakdown: a stacked bar or set of labeled numbers showing Active / Idle / Stuck / Error. The absolute counts matter; the relative proportions over time reveal fleet behavior patterns.
Aggregate throughput: fleet-wide tokens/second and tasks/minute as live-updating numbers with 5-minute trend sparklines. A drop in both simultaneously suggests fleet-wide performance degradation.
Cost burn rate: current USD/hour in large text, monthly projection in smaller text, 24-hour trend sparkline. Color-coded against budget threshold.
Error rate: rolling percentage with threshold highlighting. 0–1% green; 1–5% yellow; over 5% red. This is a fleet aggregate — individual agent error rates appear in the per-agent drill-down.
Queue depth gauge: a gauge or bar showing pending tasks against the configured maximum queue depth. Approaching maximum queue depth is an early warning of throughput problems.
Model distribution: donut chart showing what percentage of active tasks are running on each model. Useful for detecting unintended model routing (e.g., expensive frontier model handling tasks that should route to a cheaper model).
Alert strip: most recent 5 alerts with severity, metric name, and timestamp. Links to the alerting dashboard for full history.

Per-Agent Drill-Down Dashboard

When an anomaly appears in the fleet overview, the operator drills down to a specific agent or agent group. This view provides the context needed to diagnose whether the anomaly is isolated or systemic.

Key panels:

Agent identity panel: agent ID, version (gen_ai.agent.version), currently assigned model, uptime, task history (last N tasks with completion status and cost)
Context window utilization bar: a horizontal bar showing percentage of context consumed, color-coded green/yellow/red at 60%/80%/95% thresholds. This is the most visually immediate signal for impending context overflow.
Token timeline: input tokens vs. output tokens over time. A rising input:output ratio often indicates context accumulation. A rising output token count on a task that should be short-output (e.g., classification) indicates a reasoning loop.
Cost breakdown by task: table of recent tasks with duration, total tokens, and cost. Sortable by cost to immediately identify the most expensive task types.
Tool call breakdown: which tools were called in the current session, success/failure rate per tool, average latency per tool. A tool with a 30% failure rate is often the root cause of increased token consumption as the agent retries.

Trace Waterfall View

The trace waterfall is the debugging view — used after an anomaly has been identified to understand its root cause. For fleet observability, the waterfall must support parallel branch display (fan-out) in addition to sequential chains.

The Phoenix / Langfuse approach: render spans as a Gantt-style waterfall with hierarchy indicated by indentation. Parallel branches that were dispatched simultaneously appear at the same vertical position. LLM spans show token counts inline. Tool spans show execution duration and, when content recording is enabled, the call parameters and result.

For fleet-scale traces (an orchestrator plus 8 parallel workers), the waterfall needs collapsible subtrees. Expanding a worker's subtree shows its full LLM + tool call sequence; the collapsed view shows only the worker span with its aggregate duration and total token cost.

Cost Attribution Panel

Cost attribution is often the metric that gets fleet operations budget approved. A panel showing cost by feature, team, or user tier directly connects observability spend to business value.

Implementation via Langfuse metadata or Helicone-Property-* headers:

// On every outgoing LLM call from the fleet commander
{
  "Helicone-Property-task-type": "code-review",
  "Helicone-Property-user-tier": "enterprise",
  "Helicone-Property-team": "platform-engineering"
}

The resulting cost breakdown panel: a table or treemap showing cost by each dimension, filterable by time range. The key insight: if 20% of your tasks are consuming 80% of your budget, you have a cost optimization target — and this panel makes it visible.

Alerting and Anomaly Detection Architecture

Production fleet alerting requires both threshold-based rules (deterministic, predictable, low false-positive rate) and ML-based anomaly detection (catches novel failure modes that do not trigger static thresholds).

Threshold-Based Alert Rules

These are the rules that every fleet should configure on day one:

Alert	Condition	Severity	Automated Action
Context overflow risk	`context_utilization > 0.80` per agent	Warning	Log, dashboard highlight
Context critical	`context_utilization > 0.95` per agent	Critical	Trigger context compaction or task pause
Cost runaway	`cost_per_hour > 3 × rolling_1h_avg`	Critical	Notify on-call, optionally throttle
Token loop	`output_tokens_rate > 5 × task_baseline`	Warning	Flag for human review
Stuck agent	`time_in_active_state > 3 × expected_duration`	Warning	Alert + automated restart attempt
Error rate spike	`fleet_error_rate > 5%` over 5 minutes	Critical	Notify on-call
Queue saturation	`queue_depth > 0.90 × max_queue_capacity`	Warning	Trigger autoscale
Tool failure rate	`tool_success_rate < 0.70` for any tool	Warning	Flag tool health; consider disabling

ML-Based Anomaly Detection

Static thresholds miss slow-burn anomalies and novel failure modes. The production-grade complement is statistical baseline comparison:

Usage cluster deviation: Helicone and Galileo both implement variants of this pattern. Each agent type or task type develops a characteristic usage profile: typical token counts, typical tool call sequences, typical duration. When an agent deviates significantly from its learned cluster (e.g., a summarization agent suddenly producing 10× its typical output token count), it is flagged as an anomaly even if no static threshold is triggered.

95th percentile rarity flagging: Galileo's approach — scenarios that fall in the 95th percentile of rarity/risk across the fleet's historical behavior are surfaced regardless of whether they triggered a rule. This is particularly valuable for catching prompt injection attacks or model jailbreaks, which produce unusual output patterns that are hard to encode as static rules.

Galileo runtime protection: The most aggressive deployment of anomaly detection — the system blocks or reroutes individual LLM requests before they complete if quality metrics fall below threshold. This shifts from reactive monitoring (detecting failures after they happen) to proactive protection (preventing failures from reaching users). The cost is latency: every request requires a real-time quality evaluation step.

The Five Escalation Levels

A production escalation model for fleet alerts:

Log only — record the span/event with metadata; no notification; retained for post-hoc debugging
Dashboard alert — visual indicator in fleet overview panel; auto-resolves when metric returns to normal
Notification — Slack, PagerDuty, or webhook; appropriate for metrics that require human judgment
Automated intervention — throttle agent token rate, pause task queue, reroute to alternative model; fires when the automated response is lower-risk than waiting for human response
Guardrail block — W&B Weave Guardrails or Galileo runtime protection refuses or reroutes the specific LLM call; appropriate for governance-critical deployments

Governance-aware telemetry (arxiv:2604.05119) extends this model to closed-loop enforcement: observed behavior triggers automated constraint tightening. Policies are held in a central server with hot-reloadable updates, enabling compliance teams to push new constraints fleet-wide in minutes without agent redeployment. This architecture is particularly relevant for regulated industries where audit trails of every constraint change are required.

The Fleet Commander Pattern

The fleet commander is a meta-agent whose primary input is the fleet's own telemetry. It does not execute domain work — it decides who does the work, on which model, at what priority, and with what resource budget.

Core Architecture

[Incoming Task Queue]
         |
         v
[Fleet Commander Agent]
    - Intent classification (LLM call)
    - Load balancing (queue depth per worker pool)
    - Capability matching (task requirements vs. worker capabilities)
    - Cost routing (model selection by task complexity)
    - Budget enforcement (reject or defer if cost budget exhausted)
         |
    [Fan-out to Worker Pool(s)]
    /          |          \
[Worker A]  [Worker B]  [Worker C]
    |           |           |
    \          |          /
         [Coordinator]
    - Result aggregation
    - Partial failure handling
    - Final response assembly

The commander ingests fleet telemetry via a subscription to the OTel metrics pipeline. Its routing decisions are functions of real-time state: current queue depth per worker pool, per-worker context utilization, per-worker error rates in the last N tasks, and the current cost burn rate against the budget.

Trace Context Propagation

The most commonly missed implementation detail in the fleet commander pattern: the commander must inject W3C traceparent context into every worker dispatch. Without this, worker spans appear as independent root spans in the telemetry backend — the orchestrator-worker causal chain is invisible, cost rollup is impossible, and distributed debugging requires manual correlation.

For HTTP-based dispatch:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: vendor-specific-key=value

For message queue dispatch (e.g., when workers pick tasks from a queue): embed the trace context as a message attribute, not in the payload, to preserve the span boundary semantics.

Model Routing Logic in the Commander

A production fleet commander implements multi-tier routing:

Complexity tier routing: Simple classification and extraction tasks route to cost-efficient models (Haiku, Gemini Flash). Complex reasoning and code generation route to frontier models (Claude Sonnet/Opus, GPT-4o). The classification is itself an LLM call — typically a fast, cheap model evaluating a short task description against a routing rubric.

Cost-aware routing: When the fleet is running near its hourly cost budget, the commander shifts marginal tasks to cheaper models. This is not a quality degradation — it is adaptive resource management. A summary that would have gone to Sonnet routes to Haiku when the hour's budget is 90% consumed; the quality difference for that specific task type is acceptable.

Context-utilization-aware routing: Workers nearing context window saturation (>80% utilization) should not receive new long-context tasks. The commander tracks per-worker utilization from the telemetry feed and excludes near-saturated workers from routing until they complete their current task and context is cleared.

Anthropic's production pattern: Anthropic's internal multi-agent research system implements the commander pattern with a lead agent that breaks research jobs into pieces and delegates each to a specialist sub-agent with its own model, prompts, and tools. Sub-agents work in parallel on a shared file system. Claude Code's multi-agent code review feature (launched 2026) dispatches a parallel team of agents, each specializing in a different error class (security vulnerabilities, performance issues, style violations), with each agent leaving inline comments. The trace for a single code review request is a multi-level tree: review request → commander decomposition → N parallel specialist agents → coordinator result assembly.

Tooling Landscape: Choosing Your Fleet Observability Stack

The market in mid-2026 has converged around a few dominant patterns:

Open-Source Self-Hosted Stack

Langfuse + Grafana LGTM (Loki/Grafana/Tempo/Mimir): The most common self-hosted configuration for cost-sensitive teams. Langfuse handles agent-specific telemetry (traces, evaluations, cost attribution, prompt management). Grafana handles infrastructure-level metrics and fleet-wide aggregations. The integration: Langfuse exports to OpenTelemetry Collector; the collector routes to Tempo (traces), Mimir (metrics), and Loki (logs). Grafana dashboards query all three backends.

Langfuse's key differentiator for fleet observability is its evaluation-first architecture: evaluations run directly on production traces, closing the loop between observation and quality assessment. For fleet commanders, Langfuse's session tracking enables grouping all traces from a single orchestrated workflow into a single session object, enabling fleet-level cost rollup.

Arize Phoenix: Best-in-class for teams prioritizing span kind granularity and framework coverage. Phoenix's ten span kinds and native support for all major frameworks and providers means that heterogeneous fleets (mixing OpenAI, Anthropic, and Bedrock agents) produce unified traces without custom adapters. Phoenix also supports the Open Agent Specification via its Oracle integration.

Managed Cloud Stack

Datadog LLM Observability: The choice for teams already using Datadog for infrastructure monitoring. The value proposition is a single pane for AI agent telemetry alongside APM, logs, and infrastructure metrics. Datadog natively maps to OTel GenAI Semantic Conventions (2025) and provides automatic cost calculation from token counts plus provider pricing. The GitHub Copilot integration covers completions, chat, agent mode, plan mode, CLI, and PR activity via the Copilot Usage API — the only major platform shipping GitHub Copilot fleet metrics natively.

Helicone: The fastest integration path for teams without existing observability infrastructure. A proxy-first architecture means observability is a side-effect of routing — point base_url at Helicone and you immediately get cost trends, usage, latency, and geographic breakdown with zero SDK changes. Custom cost attribution via Helicone-Property-* headers. Helicone's three session workflow patterns — Workflow (/task/research/web_search), Conversation (/session/question_1), Pipeline (/process/extract/transform/load) — cover the majority of fleet orchestration patterns.

Capability Comparison

Platform	Open Source	Self-Hosted	OTel Native	Span Kinds	Eval Pipeline	Gateway Proxy	Cost Attribution
Langfuse	Yes	Yes	Yes (ingest)	Standard	Yes	No	Yes (metadata)
Arize Phoenix	Yes	Yes	Yes (built on)	10 types	Yes	No	Yes (spans)
W&B Weave	Yes	Partial	Yes	Standard	Yes (safety)	No	Yes (spans)
Helicone	Yes	Yes	Partial	Standard	No	Yes	Yes (headers)
Datadog LLM Obs	No	No	Yes (native)	Standard	Limited	No	Yes (auto)
Galileo	No	No	Partial	Standard	Yes (ML)	No	Yes
MLflow	Yes	Yes	Yes	Standard	Yes	No	Limited

Claude Code Agent Fleet Observability

Claude Code ships built-in OTel output. The open-source reference implementation claude-code-hooks-multi-agent-observability (github.com/disler/claude-code-hooks-multi-agent-observability) demonstrates real-time monitoring of multiple concurrent Claude Code agents via hook event tracking. It monitors all 12 hook event types (PreToolUse, PostToolUse, PreModelRequest, PostModelRequest, etc.) with session tracking, event filtering, and live updates — enabling fleet-level visibility into concurrent Claude Code agent sessions. The Claude Cookbook entry "The Observability Agent" (platform.claude.com/cookbook) documents building an observability meta-agent on the Claude Agent SDK that monitors other running agents — a concrete implementation of the fleet commander pattern.

VS Code's GitHub Copilot OTel support (tracked in VS Code issue #293225) enables spans to be persisted to a local SQLite database (github.copilot.chat.otel.dbSpanExporter.enabled: true). Datadog's native Copilot integration surfaces completions, chat, agent mode, plan mode, CLI, and PR activity in a single fleet dashboard.

Production Implementation Patterns

The Telemetry Pipeline Architecture

A production fleet telemetry pipeline has four stages:

Stage 1 — Instrumentation: Each agent emits OTel spans to a local OTel Collector sidecar. The sidecar provides batching, retry logic, and initial filtering. Fleet commanders inject traceparent headers into all downstream dispatches.

Stage 2 — Enrichment: The collector sidecar enriches spans with deployment metadata: deployment.environment, service.version, k8s.pod.name. It also runs the PII redaction processor — a critical step for production deployments where LLM span content may contain user data. The processor pattern: inspect gen_ai.input.messages and gen_ai.output.messages attributes, apply regex or ML-based PII detection, replace detected PII with placeholder tokens before forwarding. Never store raw LLM content without PII review.

Stage 3 — Export: The enriched spans export via OTLP/gRPC to the chosen backend. For self-hosted Grafana LGTM: Tempo for traces, Mimir for metrics, Loki for logs. For managed cloud: Datadog, New Relic, or Honeycomb via their respective OTel endpoints.

Stage 4 — Dashboard and Alerting: Grafana (for self-hosted) or the managed platform's native UI renders the fleet overview, per-agent drill-down, and trace waterfall views. Alertmanager (for self-hosted) or the managed platform's alerting engine evaluates threshold rules and routes to Slack, PagerDuty, or custom webhooks.

Sampling Strategy for Fleet Traces

The standard guidance — "AI agent traces require 100% sampling" — is correct for debugging and quality evaluation, but creates storage and cost problems at fleet scale. The production compromise is tail-based sampling at the fleet collector:

Sample 100% of traces containing ERROR spans (all failures need investigation)
Sample 100% of traces with duration > p99 of the task type baseline (slowest tail always retained)
Sample 100% of traces where cost > 3× the task type average (cost anomalies always retained)
Sample 10–20% of normal traces (representative set for baseline statistics)

This strategy retains every operationally significant trace while reducing storage by 80–90% compared to full sampling.

Context Window Monitoring Implementation

Implementing the 80% context utilization alert requires knowing the model's context limit at the time the alert evaluates. The recommended approach: maintain a model capability registry — a simple key-value store mapping gen_ai.request.model → context_window_limit. The fleet commander populates this registry from provider documentation; the alerting engine queries it when evaluating gen_ai.usage.input_tokens / context_window_limit.

The model capability registry also enables the fleet commander's context-aware routing: before assigning a long-context task to a worker, check the worker's current gen_ai.usage.input_tokens against the model's limit. Route to a model with a larger context window if the task requirements would push the current model over threshold.

Conclusion

Fleet observability is what separates AI agent systems that operate reliably in production from those that degrade silently. The technical foundation is now standardized — OpenTelemetry GenAI semantic conventions (consolidated in January 2025) provide the attribute schema; the OTLP protocol provides the transport; open-source platforms like Langfuse and Arize Phoenix provide the storage and visualization layer. The patterns described in this article — the fleet health state machine, the orchestrator trace propagation requirement, the tail-based sampling strategy, the fleet commander's telemetry feedback loop — are not theoretical. They are patterns distilled from production deployments at the companies and open-source projects described throughout.

The most important architectural decision is also the simplest: instrument from the start, not after problems appear. OTel semantic conventions are low-cost to add during initial development, and prohibitively expensive to retrofit into a fleet where context propagation was never built into the orchestrator-worker handoff. Every multi-agent system deployed today without trace context propagation across dispatch boundaries is accumulating a visibility debt that will make the next production incident significantly harder to diagnose.

The fleet commander pattern — consuming your own fleet's telemetry to make routing, throttling, and cost-management decisions — represents the maturation of AI agent systems from reactive pipelines to self-managing infrastructure. It closes the loop between observation and action in a way that no human operator can maintain at fleet scale. The telemetry pipeline, properly instrumented, becomes not just a monitoring system but the sensory system for an autonomous fleet.

Sources:

https://opentelemetry.io/docs/specs/semconv/gen-ai/ — OTel official gen_ai semantic conventions
https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/ — gen_ai client span conventions
https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/ — gen_ai metrics conventions
https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/ — gen_ai agent and framework span conventions
https://opentelemetry.io/blog/2025/ai-agent-observability/ — OTel AI agent observability blog post
https://github.com/open-telemetry/semantic-conventions/issues/2664 — gen_ai semantic conventions for agentic systems (tracking issue)
https://github.com/langfuse/langfuse — Langfuse GitHub repository
https://langfuse.com/docs/observability/overview — Langfuse observability documentation
https://github.com/Helicone/helicone — Helicone GitHub repository
https://docs.helicone.ai/features/sessions — Helicone sessions: workflow, conversation, pipeline patterns
https://github.com/arize-ai/phoenix — Arize Phoenix GitHub repository
https://github.com/wandb/weave — W&B Weave GitHub repository
https://docs.datadoghq.com/llm_observability/monitoring/cost/ — Datadog LLM observability cost monitoring
https://www.datadoghq.com/blog/llm-otel-semantic-convention/ — Datadog native OTel GenAI Semantic Conventions support
https://github.com/disler/claude-code-hooks-multi-agent-observability — claude-code-hooks-multi-agent-observability reference implementation
https://platform.claude.com/cookbook/claude-agent-sdk-02-the-observability-agent — Claude Cookbook: the observability agent pattern
https://code.visualstudio.com/docs/agents/guides/monitoring-agents — VS Code agent observability with OpenTelemetry
https://arxiv.org/pdf/2604.05119 — Governance-Aware Agent Telemetry for Closed-Loop Enforcement
https://arxiv.org/pdf/2506.11019 — Mind the Metrics: telemetry-aware in-IDE AI development
https://arxiv.org/html/2604.14228v1 — Dive into Claude Code: design space of AI agent systems
https://www.anthropic.com/engineering/multi-agent-research-system — Anthropic engineering: multi-agent research system
https://github.com/microsoft/vscode/issues/293225 — VS Code issue: agent observability based on OpenTelemetry
https://www.getmaxim.ai/articles/top-5-ai-agent-observability-platforms-in-2026/ — Top 5 AI agent observability platforms for 2026
https://galileo.ai/blog/best-ai-agent-observability-platforms — Galileo: best AI agent observability platforms review
https://mlflow.org/top-5-agent-observability-tools/ — MLflow: top 5 LLM and agent observability tools
https://openobserve.ai/blog/llm-cost-monitoring/ — OpenObserve: LLM cost monitoring with prebuilt dashboard