AI Agent Telemetry Dashboards — From Raw Signals to Actionable Insights
Executive Summary
Traditional APM dashboards answer "is the system running?" AI agent dashboards must answer "is the system making sound decisions?" This distinction — structural telemetry versus semantic telemetry — is the defining challenge of AI agent observability in 2026.
An agent can report HTTP 200, sub-second latency, and zero exceptions while returning wrong answers, burning tokens in reasoning loops, or selecting incorrect tools. These semantic failures are invisible to infrastructure-layer metrics. Production agent monitoring requires evaluation layers on top of structural telemetry to detect the failures that actually matter.
This article examines the emerging stack: OpenTelemetry GenAI semantic conventions as the data standard, open-source platforms (Langfuse, Grafana LGTM, Arize Phoenix) as the visualization layer, and privacy-preserving collector architectures that redact PII before storage. We cover the metrics that matter, the anti-patterns that destroy signal, and concrete pipeline architectures deployed in production today.
The Semantic Gap: Why Traditional APM Falls Short
Traditional APM monitors request-response cycles in microservice architectures. The trace unit is an HTTP request; the error signal is an exception or 4xx/5xx status; latency means response time. AI agent monitoring breaks every one of these assumptions.
| Dimension | Traditional APM | AI Agent Dashboard |
|---|---|---|
| Trace unit | HTTP request/response | Multi-step reasoning chain (N model calls + tool calls) |
| Error signal | Exception, 4xx/5xx | Semantic failure (HTTP 200 but wrong answer) |
| Latency meaning | Response time | TTFT, inter-token latency, full chain duration |
| Cost signal | Compute/infra cost | Token cost per step, per tool, per user tier |
| Quality signal | Not applicable | Correctness, tool selection accuracy, task completion rate |
| Sampling | Often 1-10% | 100% required — sampling discards execution trees |
The last row deserves emphasis: AI agent traces are hierarchical and multi-step. A single user request may produce 8+ inference calls across a four-agent pipeline. Sampling at 10% doesn't just lose 90% of data — it destroys complete execution trees and makes debugging impossible. Full-fidelity tracing is the cost of operating production AI agents correctly.
How Production Platforms Expose Telemetry
Claude Code
Claude Code ships with built-in OpenTelemetry output. When organizations enable telemetry, platforms like Datadog's AI Agents Console ingest real-time session data. The dashboard breaks event volume by type (api_request, api_error, tool_decision, tool_result, user_prompt) and provides views for:
- Summary: total spend and token usage across users
- Performance: latency percentiles, error rate trends, top failed commands, success rates by repository
- Cost & Value: per-user spend, model-specific cost trends (Sonnet vs Opus vs Haiku)
- Usage & Adoption: active user counts, common operations, repository-level activity
Datadog also links Claude Code telemetry to associated Git commits and pull requests — connecting AI activity directly to engineering outputs.
Other Agent Platforms
Cursor, Windsurf, and OpenAI Codex are primarily IDE-embedded and do not expose OTel telemetry natively. Monitoring happens through vendor dashboards or via proxy gateways (Helicone, Portkey) inserted at the API layer. None currently provide a bring-your-own-observability pipeline comparable to Claude Code's OTel export.
The Converging Architecture
Production agent platforms are converging on a common pattern:
Agent Runtime → OTel SDK → OTel Collector → Backend(s) → Dashboard
The OTel Collector receives on gRPC (4317) and HTTP (4318), applies processors (batching, PII redaction, attribute enrichment), and fans out to specialized backends — Prometheus/Mimir for metrics, Tempo for traces, Loki for logs.
Key Metrics That Matter
Token Economics
The fundamental unit economics metric is cost per successful session — total token spend divided by sessions where the user's goal was achieved. If solving a support ticket costs $2.00 in AI compute but $1.50 with a human, the initiative has negative ROI regardless of how impressive the technology looks.
Critical token signals:
| Metric | What It Reveals |
|---|---|
gen_ai.usage.input_tokens | Prompt size — context accumulation signal |
gen_ai.usage.output_tokens | Generation cost |
gen_ai.usage.cache_read.input_tokens | Cache effectiveness — high rates dramatically reduce cost |
gen_ai.usage.cache_creation.input_tokens | Cache write overhead |
gen_ai.usage.reasoning.output_tokens | Chain-of-thought cost (extended thinking, o-series) |
Context window economics are a hidden cost driver. In multi-step agent workflows, token costs compound quadratically — each step sees the full prior context. Monitoring context utilization per step, not just total tokens per request, is essential. A single agent stuck in a reasoning loop can exhaust an entire daily token budget in minutes.
Latency Profiles
Four distinct latency signals matter, each measuring something different:
- Time to First Token (TTFT): Perceived responsiveness. For interactive agents, 100ms TTFT feels faster than 50ms total if the latter stalls mid-stream.
- Inter-chunk latency: Streaming throughput quality — stutters degrade user experience.
- Tool execution latency: p50/p95 per individual tool. A tool failing 12% of the time is invisible in aggregate error metrics but devastating to agent reliability.
- Full chain duration: Total wall time including all model calls, tool executions, and retries.
All latency metrics should use histograms (p50, p95, p99), not averages. Averages hide long tails — a p99 at 30 seconds is invisible in an average of 2 seconds.
Tool Use Patterns
- Per-tool invocation frequency: Expensive tools called unnecessarily signal poor planning.
- Per-tool error rate: A 5% failure rate on a single tool compounds significantly across multi-step chains.
- Retry count per tool call: Hallucinated arguments or transient failures surface here first.
- Tool selection accuracy: Did the agent choose the right tool? Requires evaluation data, not just instrumentation.
Context Utilization
Context window exhaustion is a production failure mode unique to agents. When a 20-50 step workflow accumulates context from each prior step, windows fill silently. Agents lose early-step information, make incorrect decisions, and fail tasks — all returning HTTP 200. Key signals:
- Context tokens at each step of a multi-step chain
- Rate of context growth per loop iteration
- Proximity to context limit as percentage (alert at 80%)
- Context bloat events — sudden jumps of 40%+ in window size
Semantic Telemetry: Beyond Counts and Latencies
Traditional telemetry is structural: counts, durations, error rates. Semantic telemetry is behavioral: did the agent do the right thing?
The Three-Tier Metric Framework
Galileo's production framework categorizes agent metrics into three tiers:
Tier 1 — Decision Quality: Tool selection accuracy, action completion rate, context adherence, output correctness, tool error rate. These directly measure whether the agent made the right choices.
Tier 2 — Behavior Quality: Agent flow (did it take the expected path?), agent efficiency (steps taken vs optimal), action advancement (did each step move toward the goal?), instruction adherence.
Tier 3 — Safety and UX: Conversation quality, prompt injection detection rate, PII detection events, toxicity scores, user intent change tracking.
Evaluation Techniques
LLM-as-Judge: A secondary model scores each production interaction against quality criteria — relevance, correctness, groundedness. Runs asynchronously without blocking production.
Specialized SLM scorers: Lightweight models purpose-built for scoring (correctness, toxicity, PII detection) run in real time on every interaction. More cost-efficient than full LLM judges for high-volume production.
Retrieval relevance: For RAG-augmented agents, degrading retrieval quality is a leading indicator of quality collapse weeks before users notice.
User state transitions: Track users through engagement cohorts (New → Exploring → Engaged → Power User → At-Risk → Churned). Invisible degradation — users becoming less ambitious in their queries — shows up in state transitions before it appears in usage counts.
Open-Source Tooling Landscape
Langfuse
Architecture: Web + Worker containers, PostgreSQL (transactional), ClickHouse (OLAP trace storage), Redis (queueing), S3 (raw event durability). The v3 migration from PostgreSQL to ClickHouse for trace analytics was the critical design decision — aggregation queries that took seconds in PostgreSQL run in milliseconds in ClickHouse.
Self-hosted under MIT license. Native OTel OTLP ingestion plus Python/JS SDKs. Best for teams wanting full data ownership with prompt versioning and evaluation workflows in one platform.
Arize Phoenix
Open-source (Apache 2.0), 8,000+ GitHub stars. SDK-based with embedded clustering and drift detection. Stores context graphs as queryable business assets. Built on the OpenInference instrumentation standard. Best for teams needing ownership of context graphs and advanced quality analytics.
Grafana LGTM Stack
The self-hosted reference pipeline:
Agent App (any framework)
└─ OpenLLMetry / OpenLIT (auto-instrumentation)
└─ OTLP → OTel Collector (Grafana Alloy)
├─ Traces → Grafana Tempo
├─ Metrics → Grafana Mimir
└─ Logs → Grafana Loki
└─ Grafana (correlated dashboards via trace_id)
Grafana Alloy is the recommended OTel Collector distribution — 100% OTLP compatible, supporting both OTel and Prometheus formats natively. VictoriaMetrics offers an alternative backend with VictoriaTraces replacing Tempo, claiming better compression at scale.
Helicone
Proxy-based: change API base URL, collect data immediately — 15-minute setup. No SDK required. Free tier of 100K requests/month. The zero-instrumentation advantage comes with a trade-off: single point of failure and API key centralization risk.
AgentOps
Agent-framework-native (400+ framework integrations). Distinguishing feature: session replay with time-travel debugging — replay any production session step-by-step. Recursive loop detection catches agents stuck in reasoning cycles before they exhaust budgets.
OpenLLMetry (Traceloop)
The OTel-native instrumentation layer — not a storage/dashboard backend. Extends OpenTelemetry with GenAI-specific instrumentations for LLMs (OpenAI, Anthropic, Cohere), vector databases (Pinecone), and frameworks (LangChain, Haystack). Available in Python, JavaScript, Go, Ruby. Pure OTel means it plugs directly into any OTLP-compatible backend.
OpenTelemetry GenAI Semantic Conventions
The OTel GenAI semantic conventions are emerging as the standard vocabulary for AI agent telemetry. While still in Development status, they are already adopted by Datadog, New Relic, and major observability platforms.
Standard Metrics
| Metric | Unit | Description |
|---|---|---|
gen_ai.client.token.usage | {token} | Input and output token histogram |
gen_ai.client.operation.duration | s | GenAI operation duration |
gen_ai.client.operation.time_to_first_chunk | s | Time to first streaming chunk |
gen_ai.client.operation.time_per_output_chunk | s | Inter-chunk latency |
gen_ai.server.request.duration | s | Server-side request duration |
gen_ai.server.time_per_output_token | s | Duration per token (server) |
gen_ai.server.time_to_first_token | s | Server TTFT |
Standard Span Operations
gen_ai.request— Individual model callgen_ai.invoke_agent— Full agent lifecycle (task → output)gen_ai.execute_tool— Tool invocation with I/O
MCP Semantic Conventions
For Model Context Protocol servers, the conventions define 26 standard methods, four histogram metrics, and context propagation via params._meta. The critical implementation detail: context does not automatically propagate from LLM inference to MCP tool calls — inject(tool_headers) must be called manually at each boundary.
Provider-Specific Extensions
Vendor-specific conventions exist for Anthropic, OpenAI, Azure AI Inference, and AWS Bedrock, covering provider-unique attributes (Anthropic's cache control, Bedrock's guardrail invocations) within the shared framework.
Dashboard Anti-Patterns
The Resolution Rate Lie
An 85% resolution rate sounds good. If "resolved" means "conversation ended" rather than "user problem solved," the metric is meaningless. Three silent failure modes:
- Agent marks tickets resolved but answers were wrong — users give up rather than escalate.
- Knowledge base goes stale — retrieval relevance degrades for months while usage stays flat.
- Users lose trust — they remain active but shift from complex tasks to simple ones. Value collapses while DAU stays green.
The Curiosity Spike
At launch, usage surges as users test limits. Without metrics distinguishing curious explorers from daily dependents, teams misread novelty as product-market fit. Vanity metrics dressed as success.
Average Latency as Primary Metric
Averages hide the long tail. Agents with retry logic and tool failures create heavy-tailed latency distributions where averages are actively misleading. p50 and p95 are the minimum for production dashboards.
High-Cardinality Label Explosion
Adding run_id, user_id, or session_id as metric labels creates millions of unique time series, causing query timeouts and storage explosion. Segment by model, release, tool_name, user_tier — use trace IDs for individual-request debugging, not metric labels.
Alert Fatigue from Structural-Only Alerts
Porting traditional alerting patterns to AI agents produces the same ~4,484 alerts/day with 67% ignored. Agent monitoring requires behavioral anomaly detection (context window jumps 40%, tool invocations tripling) rather than static threshold alerts on individual metrics.
Privacy-Preserving Observability
The Core Tension
Observability platforms that ingest unfiltered prompts and tool I/O become inadvertent PII repositories — a GDPR and CCPA compliance liability. Full prompt logging is operationally valuable for debugging; it is legally hazardous in production.
Architecture Patterns
OTel Collector-level redaction: Custom PII processors run in the Collector pipeline between receivers and exporters, combining regex patterns with Named Entity Recognition models. Redaction happens before storage, not after — data that never reaches the database cannot be leaked.
AI Gateway redaction: Proxy-based tools (Portkey, Kong) intercept all prompts at the API layer. Input guardrails handle prompt injection detection and PII scrubbing; output guardrails filter for PII leakage and toxicity. Centralizes handling but introduces SPOF risk.
Opt-in content capture: The OTel GenAI conventions treat gen_ai.input.messages and gen_ai.output.messages as opt-in attributes, not captured by default. Teams can enable full capture in dev/staging and disable (or redact) in production.
Tokenization/masking: Sensitive entities are replaced with deterministic placeholders ([PERSON_NAME]) before logging. The reverse mapping is stored separately with restricted access. Preserves debuggability while preventing exposure.
Data Ownership
The critical question from Arize: "Does it own the context graph, or do you?" Traces that flow to vendor backends as the single source of truth create long-term dependency and potential data exposure. Self-hosted backends (Langfuse, Phoenix) or OTel-native pipelines to your own infrastructure keep context graphs as owned business assets under your security controls.
Implications for Zylos Dashboard
The zylos-dashboard project sits at the intersection of these trends. The data inventory work (OTel, Hooks, StatusLine) maps directly to the three-signal architecture emerging in production:
- OTel signals (metrics, traces, logs) provide the structural telemetry backbone
- Hook events provide agent-specific behavioral data with full tool I/O — richer than OTel alone
- StatusLine fills the gap with token economics data (cost, cache hit rate, rate limits) that neither OTel nor Hooks capture
The key insight from this research: the dashboard must serve multiple audiences (engineer debug view, cost/usage executive view, security audit view) and should prioritize semantic metrics (task completion, tool selection accuracy) alongside structural ones (latency, error rate, token count). The anti-pattern to avoid is building a beautiful dashboard of vanity metrics that shows green while agents silently degrade.
References
- OpenTelemetry GenAI Semantic Conventions — Metrics, Client Spans, MCP
- AI Agent Observability — Evolving Standards (OpenTelemetry Blog)
- Monitor Claude Code with Datadog's AI Agents Console (Datadog)
- The Enterprise Guide to AI Agent Observability (Galileo)
- Why Your AI Agent Dashboard Is Lying to You (Vindler)
- AI Agent Observability with OpenTelemetry and Grafana LGTM
- Distributed Tracing for Agentic Workflows (Red Hat)
- AgentOps: Enabling Observability of LLM Agents (arXiv)
- Langfuse v3 ClickHouse Architecture
- Arize: Best AI Observability Tools for Autonomous Agents 2026

