AI Agent Observability: LLM Telemetry in Production
Executive Summary
AI agent observability has matured into a distinct engineering discipline. Traditional APM tools (Datadog, New Relic, Grafana) remain blind to the failure modes that matter most in LLM systems — silent semantic failures, context window exhaustion, and runaway token costs. The OpenTelemetry GenAI SIG has produced a comprehensive gen_ai.* attribute schema covering spans, metrics, events, and agent-specific conventions, though all remain in Development (experimental) status as of May 2026. A rich ecosystem has emerged — Langfuse (acquired by ClickHouse), LangSmith, Helicone, SigNoz, Arize Phoenix, OpenLLMetry — each with distinct trade-offs between local sovereignty and managed convenience. Cost runaway ($47K+ incidents documented) has made cost observability a first-class concern, and PII leakage into telemetry pipelines is a regulated problem with practical mitigations now available.
The Observability Gap
Why Traditional APM Falls Short
Traditional APM is built on a core assumption: software is deterministic. Things either succeed or fail, latency is numeric, errors throw exceptions. LLMs violate all of these.
Silent semantic failure is the canonical example: an agent returns HTTP 200 with a hallucinated answer. Error rate stays 0%. Accuracy drops to 40%. Traditional APM cannot detect this failure mode.
Four specific gaps emerge:
-
Behavioral blindness — No metric for "does this answer mean what I intended." Traditional tools track requests and errors but miss AI model behavioral shifts: predictions skewing, confidence collapsing, semantic drift.
-
Statistical ignorance — Cannot detect probabilistic patterns like bimodal prediction score distributions signaling model confusion, or gradual accuracy decay that never crosses an error threshold.
-
Causality void — Assumes direct cause-and-effect. LLM failures have indirect causality: data quality shifts, context window overflow, prompt injection from user input, upstream retrieval degradation — none producing conventional error signals.
-
Context absence — Inspects requests individually. Cannot track multi-turn conversation state, detect context window utilization approaching limits, or attribute failures to history accumulated over 20 prior turns.
What Does Work
Infrastructure metrics remain valid: server CPU/GPU utilization, network latency to LLM provider endpoints, container memory, queue depth. The gap is entirely at the semantic/application layer. Datadog has partially closed this by natively supporting OTel GenAI Semantic Conventions (v1.37+), correlating gen_ai.* spans alongside existing APM traces — but this covers collection, not interpretation.
Emerging Standards: OpenTelemetry GenAI
Current Status
All GenAI semantic conventions remain in Development status — not yet Stable. The OTel GenAI SIG was formally established April 2024. The breadth of coverage has outpaced stability progression, with no public commitment on graduation timeline. Instrumentations on v1.36.0 or earlier should set OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental to opt in.
Five Signal Categories
The standard defines five signal types: Events (GenAI inputs and outputs), Exceptions (GenAI-specific error handling), Metrics (quantitative measurements), Model Spans (inference operations), and Agent Spans (agent-framework operations). Provider coverage includes Anthropic, OpenAI, AWS Bedrock, Azure AI, GCP Vertex AI/Gemini, Cohere, Mistral, Groq, DeepSeek, xAI, and IBM watsonx.ai.
Key Span Attributes
Required at span creation: gen_ai.operation.name (values: chat, execute_tool, create_agent, invoke_agent, invoke_workflow, retrieval), gen_ai.provider.name, and gen_ai.request.model.
Token usage attributes (recommended):
| Attribute | Description |
|---|---|
gen_ai.usage.input_tokens | Prompt tokens including cached |
gen_ai.usage.output_tokens | Completion tokens |
gen_ai.usage.cache_creation.input_tokens | Tokens written to provider cache |
gen_ai.usage.cache_read.input_tokens | Tokens served from cache |
gen_ai.usage.reasoning.output_tokens | Chain-of-thought tokens (separate billing) |
gen_ai.response.time_to_first_chunk | TTFT in seconds |
gen_ai.conversation.id | Session/thread ID for multi-turn correlation |
Content attributes (gen_ai.input.messages, gen_ai.output.messages, gen_ai.tool.call.arguments, gen_ai.tool.call.result) are opt-in only — a deliberate privacy-protective stance in the standard.
Standardized Metrics
| Metric | Unit | Description |
|---|---|---|
gen_ai.client.token.usage | {token} | Input and output token histogram |
gen_ai.client.operation.duration | s | End-to-end latency |
gen_ai.client.operation.time_to_first_chunk | s | TTFT for streaming |
gen_ai.server.time_to_first_token | s | Server-side TTFT (queue + prefill) |
gen_ai.server.time_per_output_token | s | Decode phase performance |
Agent-Specific Conventions
Agent identity is tracked via gen_ai.agent.id, gen_ai.agent.name, gen_ai.agent.version. Span types include create_agent, invoke_agent (CLIENT kind for remote agents like OpenAI Assistants; INTERNAL for local frameworks like LangChain), and invoke_workflow for multi-agent orchestration. The spec acknowledges an open challenge: frameworks must "reliably distinguish workflows from single agents" — not yet solved.
OpenLLMetry (Traceloop)
The open-source project closest to implementing OTel GenAI conventions in practice. Available in Python, Node.js/TypeScript, Go, Ruby. Auto-instruments LLM provider calls, vector DBs (Pinecone, Chroma, Weaviate), and frameworks (LangChain, LlamaIndex) without code changes. Emits standard OTLP to any OTel backend — zero vendor lock-in at the collection layer.
Key Production Metrics
What Teams Actually Monitor
Latency: TTFT is the most user-perceptible metric for streaming. P95/P99 end-to-end matters more than mean due to heavy tails in LLM response distributions. Time-per-output-token degrades under high concurrency.
Token Economics: Cache hit rate is a first-class KPI — cache hits cost 10–90% less depending on provider. Context utilization percentage correlates with hallucination risk as it approaches limits. Reasoning tokens must be tracked separately due to different billing.
Tool Call Health: Success/failure rate by tool type, call count per session (elevated counts signal reasoning loops), and reasoning depth (alert threshold >8 loops indicates a stuck agent).
Quality Metrics: Canary evaluation accuracy (known-answer queries every 5 minutes, alert below 90%), semantic drift score (embedding distance from baseline, alert below 0.7 similarity), and hallucination rate (LLM-as-judge is the de facto approach, though no standard methodology exists).
Cost Metrics: Cost per interaction/session/user/feature for unit economics, budget burn rate with standard alert cascade (75% / 90% / 95% / 100%), and anomalous burn rate (alert when consumption exceeds 3× rolling average within a 5-minute window — catches runaway loops faster than threshold alerts).
Observability Architecture Landscape
Two Philosophical Camps
LLM Development Platforms (Langfuse, LangSmith, Comet Opik) combine observability with evaluation suites, prompt management, and experimentation — the full development-to-production lifecycle.
Monitoring and Instrumentation Tools (SigNoz, Helicone, Arize Phoenix, OpenLLMetry) focus on telemetry collection, anomaly detection, and cost tracking without trying to be a development environment.
Tool Comparison
Langfuse — MIT license; acquired by ClickHouse in 2025. Architecture: PostgreSQL + ClickHouse + Redis + S3. 19,000+ GitHub stars, 1,000+ self-hosted ClickHouse deployments. Session replay, hallucination/toxicity evaluators, reasoning token tracking. Self-hosted (free) or Langfuse Cloud (usage-based).
LangSmith — Proprietary SaaS only (no self-host). Seat-based + trace-based pricing. Best for LangChain/LangGraph ecosystem with auto-instrumented visual execution graphs.
Helicone — Proxy-first: change API base URL, all traffic routes through gateway. ClickHouse + Kafka internally. Open source (YC W23). Zero code changes required, 100+ models, built-in caching (20–30% cost reduction), smart routing, rate limiting. Free tier: 100k requests/month.
SigNoz — OpenTelemetry-native, full-stack (LLM + infra + logs in one platform). Open source community edition + Cloud + enterprise BYOC. Key differentiator: correlates LLM traces with infrastructure metrics in a single tool.
Arize Phoenix — Elastic License 2.0 (source-available, free self-host). Drift detection for RAG, hallucination pattern visualization, pre-built eval templates.
Local-First vs. SaaS Trade-offs
| Dimension | Local-First | SaaS |
|---|---|---|
| Data sovereignty | Full | Data exits to vendor |
| Setup cost | Higher (Docker/K8s) | Minimal |
| Privacy compliance | Easier | Requires vendor DPA |
| Scale | Your responsibility | Managed |
Read-Only vs. Active Control
A major architectural divide: read-only tools (most SDK-based) instrument and surface insights without intervening. Active control tools (Portkey, Helicone gateway, AgentOps) enforce budget caps, rate limits, and circuit breakers in real time — but sit in the request path, introducing a single point of failure risk.
Multi-Agent Observability
The Trace Propagation Problem
W3C Trace Context (traceparent header: version-traceid-spanid-sampled) maintains a single trace ID across service boundaries. Each downstream service extracts context and creates a child span. In practice, most agent frameworks have automatic context propagation gaps — Llama Stack, for example, does not automatically propagate span context to MCP servers during Responses API calls, requiring manual opentelemetry.propagate.inject().
Five Key Challenges
- Agent boundary crossing: Each handoff is a potential trace break.
- Emergent behavior opacity: Individual agents work; combination fails. No single agent's trace reveals it.
- Cascading failures: Orchestrator exceeds context limit → dependent agents stall → dependent agents loop → $47K bill before alerts fire.
- Shared context contamination: Multiple agents read/write shared memory; interference invisible in individual traces.
- Dynamic execution graphs: Agent workflows fork based on model outputs — traditional tree trace visualizations fail for DAG/cyclic structures.
Emerging Solutions
Langfuse Agent Graphs (GA late 2025) provides node graph visualization of multi-agent execution with inline tool call visibility. Arize AX offers context graph ownership for enterprise scale. Maxim AI supports up to 1MB trace elements with checkpoint-based debugging — replay from specific execution points without full re-run.
Recommended patterns: hard loop limits per tool (default 50), alert on reasoning depth >8, circuit breakers at orchestrator level, maintain an agent dependency graph for deadlock/starvation detection, and separate "agent health" spans from model inference spans.
Cost Observability
The Stakes
Documented incidents: a single-query runaway cost $2,847 in tokens at a startup (2025). A multi-agent cascade at Operator Collective hit $47,000. As per-token costs fall, agents take on more autonomous work — total cost risk increases even as unit costs decrease.
The Five-Layer Architecture
- Metadata tagging (foundational): Attach
user_id,feature_name,team,session_idto every LLM call. All attribution depends on this. - Two complementary controls: Rate limits (tokens/requests per minute — latency protection) + Budget limits (dollars/period — financial protection). These are distinct and both required.
- Budget alerting: Alert cascade at 75% / 90% / 95% / 100% of cap. Each fires once per period.
- Anomaly detection: Alert when consumption exceeds 3× rolling average within 5 minutes — catches loops faster than thresholds.
- Hard session caps: e.g., $0.50/session. Graceful abort + degraded response when exceeded.
Optimization Feedback Loops
Cache hit rate as primary KPI (caching reduces costs 30–90% for repeated system prompts). Context utilization tracking reveals summarization opportunities before hitting limits. Model routing directs simple tasks to cheaper models. Prompt A/B testing with stricter output constraints reduces completion token counts.
Privacy and Security in Agent Telemetry
The Core Tension
Full observability requires capturing prompts and completions. Prompts and completions contain PII. Observability platforms become inadvertent PII repositories — a GDPR/CCPA risk. The OTel GenAI standard's decision to make content attributes opt-in only is a meaningful privacy-protective default.
Privacy-Preserving Techniques
Research from LLM-Redactor (arXiv:2604.12064, April 2026), evaluated on 1,300 synthetic samples:
| Technique | Latency | Use Case |
|---|---|---|
| Route to local 3B model | Varies | Trivial requests |
| Redaction + typed placeholders | <50ms | Primary technique |
| Semantic rephrasing (local model) | ~1–2s | When redaction alone insufficient |
| Trusted Execution Environment hosting | Minimal | Zero-tolerance environments |
| Fully homomorphic encryption | Impractical | Not recommended |
The combination of routing + redaction + semantic rephrasing achieves a 0.6% leak rate for PII-heavy prompts with ~1–2s overhead. A critical limitation: implicit identity (writing about someone without naming them) achieves >95% semantic leak rate even with all mitigations — the structural relationships are the content, and no technical solution exists yet.
The Safe Observability Architecture
The recommended pattern is an OTel Collector with a custom PII-Redaction Processor (Go) using regex + NER (Microsoft Presidio + spaCy). Redaction at the collector layer ensures uniform coverage — per-service redaction risks inconsistent rules. For maximum sovereignty, deploy the collector as a localhost-only sidecar: raw prompt data never leaves the host, only sanitized telemetry forwards to the backend.
Implications for Agent Infrastructure Teams
The field is converging on several practical patterns:
Start with OTel GenAI conventions even though they're experimental — the gen_ai.* schema is comprehensive and the closest thing to a standard. The experimental status means breaking changes are possible, but the alternative (custom schemas) is worse.
Separate semantic monitoring from infrastructure monitoring. Use traditional APM for infra, purpose-built tools for LLM semantics. SigNoz is the closest to a single-pane solution, but most teams run parallel stacks.
Cost observability is not optional. Implement budget alerting and hard session caps before deploying agents autonomously. The 3× rolling average anomaly detection catches loops faster than static thresholds.
Privacy redaction belongs in the telemetry pipeline, not in application code. The OTel Collector processor pattern is architecturally sound and practically viable. Opt-in content capture with localhost-only collection gives the best balance of observability and sovereignty.
Multi-agent tracing is still immature. Expect manual context propagation work at agent boundaries. Design for it from the start — retrofitting trace propagation into an existing multi-agent system is painful.
Data Points
| Metric | Value | Source |
|---|---|---|
| OTel GenAI SIG established | April 2024 | opentelemetry.io |
| All GenAI conventions status | Development (experimental) | opentelemetry.io |
| Langfuse GitHub stars | 19,000+ | SigNoz comparison |
| Langfuse acquired by ClickHouse | 2025 | ClickHouse blog |
| Runaway agent cost (single query) | $2,847 | DEV.to |
| Runaway agent cost (multi-agent cascade) | $47,000 | Operator Collective |
| Caching cost reduction range | 30–90% | Portkey |
| PII redaction overhead (regex+NER) | <50ms | arXiv:2604.12064 |
| Implicit identity leak rate (all mitigations) | >95% | arXiv:2604.12064 |
| Reasoning loop alert threshold | >8 loops | Industry practice |
| "AI agent observability" search growth YoY | +193% | Early 2026 |
| Splunk AI Agent Monitoring GA | Q1 2026 | Splunk blog |

