Seeing Inside the Black Box: Agent Observability in Production

Executive Summary

For the last two years, most teams shipping AI agents into production have leaned on the same tooling they use for web services: structured logs, error rates, and latency dashboards. It works — until it doesn't. A 200 OK with a 4-second latency tells you the request completed. It tells you nothing about whether the agent looped twice before answering, called the wrong tool with hallucinated arguments, or silently discarded half the user's context mid-run.

Agent observability is the discipline of making that invisible layer inspectable. The core idea is not new — distributed tracing has been a standard practice in microservice architectures for years. What is new is the semantic layer on top: a growing set of conventions that describe AI-specific operations (LLM calls, tool invocations, agent handoffs, memory reads) in a structured, queryable form rather than as raw log strings.

As of mid-2026, the toolchain has matured enough that teams have real choices: fully managed platforms with generous free tiers, open-source self-hosted stacks, and a converging OpenTelemetry standard that prevents vendor lock-in. The challenge has shifted from "how do I instrument this at all" to "how do I instrument it efficiently, without drowning in data."

Why Traditional APM Falls Short

A conventional APM trace for an agent request looks like this: one HTTP span, one or two database queries, 200 OK, done. Latency: 6 seconds. Everything green.

What actually happened during those 6 seconds: the agent received a question, called a knowledge retrieval tool with an over-broad query that returned 40 irrelevant chunks, fed them into a summarization step that hallucinated a policy that doesn't exist, tried to validate the result against a schema, failed, retried with a different prompt, succeeded on the second attempt, and returned an answer that was technically valid but factually wrong in a way no schema check could catch.

The APM trace shows none of this. Traditional observability was designed around deterministic code paths — the same inputs produce the same outputs. Agents violate this assumption structurally. They branch conditionally, loop dynamically, and make probabilistic decisions at every step. What you need is not a performance trace but a behavioral trace: a record of what the agent decided, in what order, with what inputs and outputs at each step.

This is why purpose-built agent observability tools exist, and why they are converging on a fundamentally different data model than traditional APM.

The OpenTelemetry Foundation

OpenTelemetry (OTel) has become the instrumentation standard for agent observability, for the same reason it won in microservices: it is vendor-neutral, widely adopted, and adds negligible overhead (under 1ms per call, compared to LLM API latency of 100ms–30s).

The critical addition for AI systems is the GenAI semantic conventions — a set of standardized span attributes and span types developed by OTel's GenAI Special Interest Group. These conventions define the vocabulary for describing AI operations in a way that is consistent across providers and frameworks.

Span Types for Agents

The current stable conventions define three primary agent-level span types:

Span Name	Kind	Use Case
`create_agent {name}`	CLIENT	Initializing a remote agent service
`invoke_agent {name}`	CLIENT / INTERNAL	Calling an agent (local or remote)
`invoke_workflow {name}`	INTERNAL	Coordinated multi-agent processes

Each LLM call within an agent run becomes a child span with attributes including:

gen_ai.operation.name — the operation type (chat, invoke_agent, etc.)
gen_ai.provider.name — provider identifier (openai, anthropic, aws.bedrock)
gen_ai.request.model — the exact model version
gen_ai.usage.input_tokens / gen_ai.usage.output_tokens — token counts for cost attribution
gen_ai.response.finish_reasons — why the model stopped (important for detecting truncation)
error.type — error classification when operations fail

Tool calls get their own execute_tool child spans, capturing the tool name, input arguments, and response. This is where a lot of the most actionable debugging signal lives — if an agent is calling the right tool with wrong arguments, or retrying a tool call silently, it shows up here before it shows up anywhere else.

A Minimal Instrumentation Example

Here is what a manually instrumented Python agent looks like with the OTel GenAI conventions, using the opentelemetry-sdk and opentelemetry-instrumentation-anthropic packages:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure export to your backend (Langfuse, Phoenix, Braintrust, etc.)
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("my-agent", "1.0.0")

def run_agent(user_input: str):
    with tracer.start_as_current_span(
        "invoke_agent my-research-agent",
        attributes={
            "gen_ai.operation.name": "invoke_agent",
            "gen_ai.agent.name": "my-research-agent",
            "gen_ai.provider.name": "anthropic",
        }
    ) as agent_span:
        # Tool call span
        with tracer.start_as_current_span(
            "execute_tool web_search",
            attributes={
                "gen_ai.operation.name": "execute_tool",
                "gen_ai.tool.name": "web_search",
                "gen_ai.tool.call.arguments": '{"query": "..."}',
            }
        ) as tool_span:
            result = web_search(user_input)
            tool_span.set_attribute("gen_ai.tool.result", result[:500])

        # LLM call span (typically auto-instrumented)
        response = call_llm(user_input, result)
        agent_span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
        return response

In practice, most teams use auto-instrumentation packages — opentelemetry-instrumentation-anthropic, opentelemetry-instrumentation-openai, or framework-level adapters for LangChain/LangGraph, CrewAI, and Pydantic AI — rather than writing spans by hand. The manual example above is useful mainly for custom agent code that falls outside those frameworks.

The Four Failure Modes That Traces Expose

Once traces are flowing, four categories of agent failures become visible that are essentially undetectable from logs or metrics alone:

1. Tool-call hallucination. The agent invokes a real tool with fabricated arguments. The tool call span shows the argument payload the model generated — before execution. You can alert on argument patterns that look malformed or that contain fields the tool schema doesn't define.

2. Silent retry loops. An agent retries a failing tool call multiple times before giving up or succeeding. From the outside this looks like normal latency variation. In the trace it's visible as a parent span containing 4–5 execute_tool child spans for the same tool, with escalating latency.

3. Plan drift. The agent's stated reasoning at step 1 diverges from its actual actions at step 3. Reasoning spans — capturing the model's chain-of-thought where available — surface this. This is particularly common in long multi-step workflows where each LLM call has a truncated view of earlier context.

4. Stale memory reads. In agents with persistent memory, a retrieval step pulls outdated or wrong-entity records. Memory operation spans, recording the query and the returned records, make this directly debuggable without reconstructing the session from scratch.

Tail Sampling: Not Drowning in Data

A busy production agent deployment generating traces for every run quickly becomes unmanageable. 100% trace retention is expensive and, more importantly, most of those traces are uninteresting. The mature pattern is tail-based sampling: make the sampling decision after the trace completes, based on what happened, not at the start.

A practical production policy:

Retain 100%:
  - Traces containing any span with status=ERROR
  - Traces where gen_ai.usage.input_tokens + gen_ai.usage.output_tokens > 50,000
  - Traces where total estimated cost > $0.50
  - Traces where total latency > 30s

Retain 10%:
  - All other traces (uniform random sample)

This policy ensures you always have full coverage for failures and expensive outliers — the cases where you most need to debug — while keeping storage costs manageable for the happy-path majority. Tools like Grafana Tempo support tail-based sampling natively via the Tempo Distributor. Braintrust and Langfuse implement it as a configurable sampling rule at ingestion time.

The Observability Platform Landscape

The market has consolidated around three tiers:

Managed, eval-integrated platforms. Braintrust (raised $80M in early 2026) is the leading example: it pairs trace capture with built-in evaluation harnesses and CI/CD quality gates, so failing traces can be converted directly into eval test cases. LangSmith is the strongest option for teams already on the LangChain/LangGraph stack, with deep framework integration.

Open-source / self-hosted. Langfuse and Arize Phoenix are the leading options here. Both are OTel-native, which means traces from any OTel-instrumented stack export to them without vendor-specific adapters. Phoenix is particularly well-suited for teams that want to run the observability backend in their own infrastructure — useful when traces contain sensitive data that can't leave the environment.

Enterprise APM with GenAI extensions. Datadog added native support for OTel GenAI semantic conventions in 2025, and Grafana's stack (Tempo + Loki + Grafana) handles agent traces as a natural extension of existing infrastructure. These options make sense for teams that already have investment in these platforms and don't want to operate a separate AI observability service.

The practical guidance: if you are starting from scratch, Langfuse self-hosted (for data control) or Braintrust (for the eval workflow) are the lowest-friction entry points. Both accept OTel-formatted traces, which means you can switch backends later without re-instrumenting.

A Staged Adoption Path

Teams that have shipped agent observability successfully tend to follow a consistent progression rather than trying to instrument everything at once:

Day 1 — Capture the basics. Instrument LLM calls and tool invocations. Record model, token counts, latency, and errors. This gives you cost visibility and a failure rate baseline within hours.

Week 1 — Add production context. Tag traces with user/session IDs, deployment version, and feature flags. Enable per-trace cost rollups. Start converting high-cost or high-latency traces into investigation targets.

Month 1 — Close the feedback loop. Add online scoring on a sample of traces (e.g., LLM-as-judge for response quality on 5% of traffic). Convert production failures into eval dataset cases so regression testing catches the same failure class before the next deploy.

Quarter 1 — Gate on quality. Enforce quality thresholds in CI. A deploy that regresses the eval pass rate by more than N% is blocked automatically. At this point observability has moved from a debugging tool to a development constraint — the appropriate end state for production systems.

Implications for Agent Developers

If you are shipping agents into production today and relying on logs and error rates, the gap between what you can see and what is actually happening is larger than it appears. The good news is the tooling has crossed the threshold from "research project" to "production-ready" — OTel auto-instrumentation packages exist for every major framework, and the semantic conventions are stable enough to build on.

The highest-leverage starting point is not the most sophisticated observability platform; it is getting OTel traces flowing to any backend, even just local Jaeger, so you can see a real trace of a real agent run. The architecture decisions — which platform, which sampling policy, how to structure eval pipelines — become much clearer once you have seen what your agent actually does.

Three concrete actions:

Add opentelemetry-instrumentation-<your-provider> to your agent's dependencies today. Most auto-instrumentation packages are a one-line import and zero code changes.
Adopt the gen_ai.* semantic conventions for any custom spans. This keeps your traces interoperable as the tooling ecosystem evolves.
Start tracking per-trace cost from day one. Token counts in span attributes make this essentially free to compute, and cost attribution to specific workflow patterns reveals optimization opportunities that would otherwise be invisible.