Zylos LogoZylos
2026-03-22

Site Reliability Engineering for AI Agent Systems: Observability, Incident Response, and Operational Patterns

ai-agentssreobservabilityincident-responseoperationsreliabilitymonitoring

Executive Summary

Site Reliability Engineering was built for a world where systems fail in deterministic, reproducible ways. An API times out, a database runs out of connections, a memory leak fills the heap — the failure modes are bounded and the remediation is clear. Autonomous AI agents break this assumption at every layer. An agent can be technically "up" — returning 200 OK, processing messages, executing tool calls — while silently producing wrong outputs, looping on a task it will never complete, or taking irreversible actions based on hallucinated context.

This creates a new discipline: SRE for agentic AI systems. The core SRE toolkit — SLOs, error budgets, distributed tracing, incident runbooks, chaos engineering — all apply, but each concept requires meaningful adaptation. "Reliability" for an AI agent is not just availability and latency; it includes decision quality, task completion fidelity, cost-per-operation bounds, and the ability to know when to stop and ask a human.

This article synthesizes production patterns from teams operating multi-agent systems in 2025-2026. It covers the full operational stack: defining what reliability means for agents, building an observability stack that exposes agent internals, designing incident runbooks for failure modes that have no precedent in traditional SRE, and operating sustainably when your on-call rotation includes an entity that can act faster than humans can supervise.

Key findings:

  • Traditional SLOs must be extended with "judgment SLOs" measuring decision quality, not just system health
  • OpenTelemetry GenAI Semantic Conventions have emerged as the de facto standard for agent telemetry
  • The most dangerous agent failures are graceful from a systems perspective — no exceptions, no alerts, wrong outputs
  • Human-in-the-loop thresholds (typically 80-95% confidence depending on risk domain) are the primary blast-radius control
  • Token budget management is both a cost control and a reliability signal — abnormal token usage almost always indicates a behavioral anomaly

SRE Principles Adapted for AI Agents

Rethinking Reliability: What Does "Working" Mean?

For a web service, reliability is measurable: did the request succeed, did it meet latency targets, did it return correct data? For an AI agent, the equivalent questions are harder to answer automatically:

  • Did the agent complete the task the user intended, or did it complete a related but different task?
  • Did it use tools correctly, or did it hallucinate parameters?
  • Did it stop at the right point, or did it continue beyond its mandate?
  • Was the cost of the operation proportionate to its value?

The industry is converging on a multi-dimensional reliability model for agent systems:

DimensionTraditional SLIAgent SLI
AvailabilityHTTP 200 rateMessage processing rate
Latencyp95 response timeTask completion time (end-to-end)
CorrectnessError rateTask success rate + output quality score
CostInfrastructure spendTokens per task + cost per task
SafetyN/AHuman escalation rate + override rate

The most significant addition is the judgment SLI — a measure of whether the agent's decisions were appropriate, independent of whether the system was technically available. Judgment SLOs work like traditional SLOs: you set a target (e.g., "less than 5% of tasks result in a human override"), measure against a time window, and track the error budget. You don't need ground-truth labels to measure decision quality in real time; human overrides and explicit corrections are a sufficient signal.

Defining Agent SLOs

Concrete SLO definitions for production agent systems:

# Agent SLO Specification
slos:
  task_completion:
    description: "Fraction of initiated tasks that reach a successful terminal state"
    target: 0.95
    window: 7d
    sli: "count(tasks where status=completed) / count(tasks where status in [completed, failed, timeout])"

  task_latency_p95:
    description: "95th percentile end-to-end task duration"
    target: 30s
    window: 7d
    sli: "percentile(task_duration_seconds, 95)"

  cost_per_task:
    description: "Average token cost per task stays within budget"
    target: 5000  # tokens per task
    window: 24h
    sli: "avg(task_input_tokens + task_output_tokens)"
    alert_threshold: 1.5x  # alert if 50% above target

  human_escalation_rate:
    description: "Fraction of tasks requiring human intervention"
    target: 0.10  # escalation should be the exception
    window: 24h
    sli: "count(tasks with escalation_event) / count(total_tasks)"

  decision_quality:
    description: "Fraction of completed tasks NOT overridden or corrected by users"
    target: 0.95
    window: 7d
    sli: "1 - (count(tasks with correction_event) / count(completed_tasks))"

Error Budgets for Agentic Systems

Error budgets in agentic systems serve the same purpose as in traditional SRE — they quantify acceptable unreliability and gate deployment decisions — but they acquire a second function: they act as an early warning system for behavioral drift.

An agent whose task completion rate drops from 97% to 94% over a week is burning error budget. But an agent whose token cost per task rises 40% while task completion stays stable is also burning a different kind of budget — it's working harder for the same result, which often precedes outright failure. Token cost trend is a leading indicator that lags about 24-48 hours ahead of visible output quality degradation.

Error budgets 2.0, as emerging in 2026, add autonomous enforcement: the agent runtime itself monitors remaining budget and can throttle its own operation — reducing parallelism, increasing human checkpoints, or pausing autonomous actions — when the budget is nearly exhausted.

Toil Reduction in Agent Operations

The classic SRE definition of toil (manual, repetitive, reactive operational work) applies directly to agent operations. Common agent toil includes:

  • Manually restarting agents that hung on tool calls
  • Reviewing and approving agent actions that exceeded confidence thresholds
  • Cleaning up partial state left by agents that crashed mid-task
  • Rotating API keys when an agent exposes them in logs or outputs

The goal is to automate the toil away, not absorb it. Notably, the "verification tax" — where AI agents create new toil by requiring humans to review AI-generated actions before execution — can make toil worse if not managed carefully. The 2025 State of Incident Management report found toil rose from 25% to 30% at organizations that added AI agents without removing corresponding manual processes. The fix is to instrument agent decisions at a level that makes automated verification possible, not to rely on human review for routine actions.


Observability Stack for Agent Systems

Why Traditional Observability Falls Short

Standard APM tools answer: did this request succeed, how long did it take, what error was thrown? AI agents violate the premises behind these questions:

Non-determinism. The same prompt with the same inputs can produce meaningfully different tool call sequences and outputs. A latency spike might indicate a bug, or it might indicate the model chose a correct but more expensive reasoning path.

Compound operations. A single user request might trigger ten LLM calls, five tool executions, two database lookups, and a web fetch — each with its own latency, token cost, and failure mode. Traditional request/response tracing captures one hop; agent tracing must capture the full decision tree.

Graceful semantic failures. An agent can return HTTP 200 with a well-formed response while having completely misunderstood the task. Traditional monitoring has no signal for this.

Token cost as a behavioral signal. Unlike CPU or memory, token consumption is both a cost center and a functional indicator. An agent using 50,000 tokens for a task that normally takes 3,000 is almost certainly misbehaving.

OpenTelemetry as the Standard

The industry has converged on OpenTelemetry (OTel) as the telemetry layer for AI agent systems. The OTel GenAI Semantic Conventions SIG, active since April 2024, has standardized attribute schemas for LLM calls, agent invocations, tool executions, and session-level metrics. As of early 2026, Datadog, Honeycomb, and New Relic all support these conventions natively, and frameworks including LangChain, CrewAI, AutoGen, and AG2 emit OTel-compliant spans directly.

The critical advantage: collect once, route to any backend. No vendor lock-in, and spans from different agent frameworks are comparable because they use the same attribute vocabulary.

Distributed Tracing Across Agent Delegation Chains

The most operationally significant OTel capability for multi-agent systems is trace context propagation across delegation boundaries. When an orchestrator agent delegates a subtask to a specialist agent — which then calls tools, makes LLM calls, and potentially delegates further — the entire operation should appear as a single trace with parent-child span relationships.

// Propagating trace context across agent boundaries (TypeScript)
import { context, propagation, trace } from '@opentelemetry/api';

// Orchestrator agent: create a span and inject context into the delegation call
const tracer = trace.getTracer('orchestrator-agent', '1.0.0');
const span = tracer.startSpan('invoke_agent specialist-research', {
  kind: SpanKind.CLIENT,
  attributes: {
    'gen_ai.operation.name': 'invoke_agent',
    // Note: gen_ai semconv does not yet define agent identity attributes.
    // Use custom attributes until the spec evolves.
    'agent.name': 'specialist-research',
    'agent.description': 'Performs web research on delegated topics',
    'gen_ai.conversation.id': sessionId,
    'task.id': taskId,
  }
});

// Inject context into HTTP headers or message payload
const carrier: Record<string, string> = {};
propagation.inject(context.with(trace.setSpan(context.active(), span), context.active()), carrier);

// Pass carrier to the specialist agent (via HTTP, message queue, etc.)
const result = await callSpecialistAgent({ task, traceContext: carrier });
span.end();

// Specialist agent: extract context and create child spans
const parentContext = propagation.extract(context.active(), incomingCarrier);
const childSpan = tracer.startSpan('invoke_agent specialist-research', {
  kind: SpanKind.SERVER,
}, parentContext);
// All LLM calls and tool executions within this agent are children of childSpan

This produces traces like:

orchestrator-agent [0ms - 8500ms]
  ├── invoke_agent specialist-research [100ms - 5200ms]
  │   ├── chat anthropic [100ms - 2100ms]  (claude-3-5-sonnet, 2847 input tokens, 412 output tokens)
  │   ├── execute_tool web_search [2200ms - 4100ms]
  │   └── chat anthropic [4200ms - 5200ms]  (claude-3-5-sonnet, 6234 input tokens, 891 output tokens)
  └── invoke_agent specialist-writer [5300ms - 8500ms]
      └── chat anthropic [5300ms - 8400ms]  (claude-3-5-sonnet, 9012 input tokens, 1247 output tokens)

This view makes it immediately obvious where time and tokens are spent, which agent is the bottleneck, and which tool calls are slow.

The GenAI Semantic Convention Attribute Schema

Key attributes to capture on every LLM call span:

# Standard GenAI semantic convention attributes (as of early 2026):
gen_ai.provider.name       # "anthropic", "openai", "aws.bedrock"
gen_ai.request.model       # "claude-3-5-sonnet-20241022"
gen_ai.response.model      # actual model used (may differ if routing)
gen_ai.usage.input_tokens  # prompt tokens (including cache hits)
gen_ai.usage.output_tokens # completion tokens
gen_ai.usage.cache_read.input_tokens    # prompt cache hits
gen_ai.usage.cache_creation.input_tokens # prompt cache fills
gen_ai.request.temperature
gen_ai.response.finish_reason  # "stop", "max_tokens", "tool_use"
gen_ai.conversation.id     # links spans within a conversation/session

# Custom attributes (not yet standardized — use app-specific namespace):
agent.name                 # which agent made this call
task.id                    # links all spans for one task

The gen_ai.response.finish_reason attribute is particularly useful for operations: max_tokens indicates the model hit a hard ceiling and its output is truncated, often producing incomplete or malformed results. A high rate of max_tokens finish reasons in a specific agent is a direct signal of a context management problem.

Metrics to Instrument

Beyond traces, the following time-series metrics should be collected at the agent and task level:

# Task-level metrics
agent_tasks_total{agent, status}              # counter: tasks by agent and outcome
agent_task_duration_seconds{agent}            # histogram: end-to-end task duration
agent_task_tokens_total{agent, token_type}    # counter: input/output tokens per agent
agent_task_cost_dollars{agent}                # counter: estimated cost per task
agent_escalations_total{agent, reason}        # counter: human escalation events

# Tool-level metrics
agent_tool_calls_total{agent, tool, status}   # counter: tool calls by agent/tool/outcome
agent_tool_duration_seconds{agent, tool}      # histogram: tool execution time

# Health metrics
agent_heartbeat_last_seen_seconds{agent}      # gauge: seconds since last heartbeat
agent_context_size_tokens{agent}              # gauge: current context window usage
agent_loop_count{agent, task}                 # counter: LLM call iterations per task

A Prometheus alert for runaway agent loops:

- alert: AgentLoopCountHigh
  expr: increase(agent_loop_count[5m]) > 10
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "Agent {{ $labels.agent }} may be stuck in a loop"
    description: "{{ $labels.agent }} has made {{ $value }} LLM calls in the last 5 minutes on task {{ $labels.task }}"

Anomaly Detection for Behavior Drift

Static thresholds work for well-understood metrics but miss the behavioral drift patterns that precede agent failures. Production teams are adding statistical anomaly detection on top of basic metrics:

  • Token cost drift: If an agent's mean tokens-per-task increases by >2 standard deviations from a 7-day baseline, it warrants investigation even if the raw number is below the hard limit. This pattern typically indicates prompt accumulation — the agent is not summarizing or truncating context correctly.
  • Tool call ratio drift: If an agent normally calls a specific tool N times per task and begins calling it 3N times, it may be in a soft loop — retrying a failing tool call rather than escalating.
  • Latency vs. token correlation breakdown: Under normal operation, task latency correlates with token count. When this correlation breaks (high latency, normal tokens) it suggests the model is blocked on an external tool, not reasoning.

Incident Response for Agent Misbehavior

The Failure Taxonomy

AI agent failures fall into categories that map poorly to traditional incident types. A partial taxonomy:

Failure ModeSymptomsRisk Level
Agent loopRising loop counter, high token consumption, no task completionMedium — wasteful, not immediately harmful
Hallucinated tool parametersTool errors on valid-looking calls, unexpected side effectsHigh — can cause irreversible actions
Context overflowmax_tokens finish reasons, truncated outputs, partial task completionMedium — degraded quality, recoverable
Memory corruptionInconsistent behavior across sessions, contradictory actionsHigh — hard to detect, can persist
Prompt injectionAgent acting on external data as if it were instructionsCritical — attacker-controlled behavior
Runaway costToken consumption 10x normal, no completionMedium — financial impact
Silent semantic failureNormal metrics, wrong outputsHighest — no automatic signal

Runbook: Agent Loop

Detection: agent_loop_count alert fires, or task duration exceeds 3x p95 baseline.

Triage:

  1. Pull the agent's trace for the in-progress task — look for the repeating pattern
  2. Check whether the agent is retrying the same tool call (tool error loop) or re-calling the LLM with the same prompt (reasoning loop)
  3. Check gen_ai.response.finish_reason — if it's cycling through tool_use responses without completing, it's a reasoning loop

Remediation:

  1. If a tool is returning errors: circuit-break the tool, let the agent handle the error, and escalate to human
  2. If it's a reasoning loop: terminate the task, preserve the trace for analysis, notify the user
  3. Apply immediate mitigation: add a max-iterations guard to the agent runtime
// Max iterations guard — should be in every agent's main loop
const MAX_ITERATIONS = 20;
let iterations = 0;

while (!taskComplete) {
  if (++iterations > MAX_ITERATIONS) {
    await escalateToHuman(task, {
      reason: 'max_iterations_exceeded',
      currentIteration: iterations,
      lastResponse: lastLLMResponse,
    });
    break;
  }
  // ... normal agent step
}

Runbook: Hallucination-Driven Action

Detection: User report, post-hoc audit, or semantic anomaly detection flagging outputs that contradict known facts.

Triage:

  1. Retrieve the full trace including prompt content (requires content capture enabled in OTel spans)
  2. Identify which LLM call produced the hallucinated content
  3. Determine whether the hallucination was acted upon (tool calls after the hallucinated response)

Remediation:

  1. If irreversible action was taken: follow data breach / unintended action runbook for the affected system
  2. Hotfix: add a validation step between LLM response and tool execution for the affected action type
  3. Longer term: implement output validation guardrails (Guardrails AI, NeMo Guardrails) on the affected action categories

Kill switch: Every agent system should have an immediate-halt mechanism that can be triggered from the incident response channel:

# Emergency agent halt — stops task processing without crashing state
node agent-control.js halt --agent=production --reason="hallucination-incident-2026-03-22" --preserve-state

Runbook: Context Overflow

Detection: Rising rate of max_tokens finish reasons, increasing frequency of partial or truncated outputs.

Triage:

  1. Check agent_context_size_tokens gauge — if it's approaching the model's context limit, the issue is context accumulation
  2. Look at context growth over the session — is the agent accumulating tool results without summarization?

Remediation:

  1. Immediate: trigger context compaction — summarize older turns and drop raw tool outputs beyond a recency window
  2. If the agent has a multi-turn conversation state, summarize and archive older turns
  3. Structural fix: implement rolling context windows with summarization rather than append-only context

Runbook: LLM Provider Outage

When the upstream LLM provider (Anthropic, OpenAI, etc.) has an outage, agents that depend on synchronous LLM calls will stall or fail.

Graceful degradation hierarchy:

  1. Retry with backoff: For transient errors, retry up to 3 times with exponential backoff
  2. Model fallback: Route to a secondary model (e.g., if claude-3-5-sonnet is unavailable, fall back to claude-3-haiku or gpt-4o)
  3. Queue and defer: For non-urgent tasks, queue them for when the provider recovers
  4. Graceful decline: For user-facing requests, return a clear message that the service is temporarily degraded
async function callLLMWithFallback(messages: Message[], options: LLMOptions) {
  const providers = [
    { model: 'claude-3-5-sonnet-20241022', client: anthropicClient },
    { model: 'claude-3-haiku-20240307', client: anthropicClient },
    { model: 'gpt-4o', client: openaiClient },
  ];

  for (const provider of providers) {
    try {
      return await provider.client.call(messages, { ...options, model: provider.model });
    } catch (err) {
      if (isProviderUnavailable(err)) {
        metrics.increment('llm_provider_fallback', { from: options.model, to: provider.model });
        continue;
      }
      throw err; // non-availability errors propagate immediately
    }
  }
  throw new Error('All LLM providers unavailable');
}

Health Checking and Liveness

The Heartbeat Architecture

Traditional process monitors check whether a process is running. For AI agents, "running" is insufficient — the process may be alive but stuck, running but producing garbage, or functional but not processing the task queue.

Agent-native health checks operate at three levels:

Level 1 — Process liveness: Is the process running and accepting signals? (PM2 / systemd watchdog)

Level 2 — Event loop liveness: Is the process's event loop actually making progress? (Internal heartbeat timer)

Level 3 — Task liveness: Is the agent actually completing tasks, or is it stuck? (Task throughput monitoring)

// Application-level heartbeat — emitted from inside the event loop
// This catches hangs that process-level monitors miss
class AgentHeartbeat {
  private timer: NodeJS.Timeout | null = null;
  private lastTaskCompleted: number = Date.now();

  start(intervalMs = 30_000) {
    this.timer = setInterval(async () => {
      const timeSinceLastTask = Date.now() - this.lastTaskCompleted;

      await this.metricsClient.gauge('agent.heartbeat.timestamp', Date.now());
      await this.metricsClient.gauge('agent.time_since_last_task_ms', timeSinceLastTask);

      // Alert if no task has completed in the last 10 minutes during expected-busy hours
      if (timeSinceLastTask > 10 * 60 * 1000 && this.isExpectedBusyPeriod()) {
        await this.alerting.trigger('agent_task_queue_stalled', { timeSinceLastTask });
      }
    }, intervalMs);
  }

  recordTaskComplete() {
    this.lastTaskCompleted = Date.now();
  }
}

PM2 and Systemd Integration

For Node.js-based agents, PM2 provides the practical baseline:

// ecosystem.config.js — agent-specific PM2 configuration
module.exports = {
  apps: [{
    name: 'production-agent',
    script: './agent.js',
    instances: 1,              // stateful agents must be single-instance
    autorestart: true,
    watch: false,              // never watch files in production
    max_memory_restart: '2G',  // LLM context can accumulate significant heap
    exp_backoff_restart_delay: 100,  // exponential backoff to prevent thrash
    restart_delay: 5000,       // flat delay to allow external dependencies to recover
    listen_timeout: 60000,     // agents take longer to initialize than web servers
  }]
};

The listen_timeout setting is critical: agents connect to databases, load memory state, and verify API credentials during startup. A 3-second default timeout (appropriate for a web server) will cause PM2 to repeatedly kill and restart an agent that needs 30 seconds to initialize.

Graceful Degradation

When upstream services degrade, agents should reduce their operation scope rather than failing completely:

Degraded DependencyDegraded Behavior
LLM provider slowIncrease task timeout, queue non-urgent work
Memory store unavailableOperate without persistent memory, flag session as ephemeral
Tool APIs unavailableComplete text-only tasks, decline tool-dependent tasks explicitly
Message queue backpressureProcess FIFO, drop lowest-priority tasks, alert on queue depth

Capacity Planning

Token Budget Management

Token consumption is the primary resource to plan for in multi-agent systems. Unlike CPU or memory, token consumption is billed per operation and can spike dramatically with context accumulation or agent loops.

Per-agent token budgets prevent one runaway agent from consuming the entire monthly API allocation:

class TokenBudgetManager {
  async checkBudget(agentId: string, estimatedTokens: number): Promise<boolean> {
    const { used, limit } = await this.getBudget(agentId);

    if (used + estimatedTokens > limit * 0.90) {
      // Approaching budget — throttle to human-reviewed tasks only
      await this.setThrottleMode(agentId, 'human-review-required');
    }

    if (used + estimatedTokens > limit) {
      await this.alerting.trigger('agent_budget_exceeded', { agentId, used, limit });
      return false; // reject the operation
    }

    return true;
  }
}

Token budgets by operation class help distinguish expected from unexpected consumption:

token_budgets:
  research_task:     { input: 20000, output: 3000 }
  code_review:       { input: 15000, output: 2000 }
  document_summary:  { input: 50000, output: 1000 }
  casual_conversation: { input: 2000, output: 500 }

When an agent exceeds its operation-class budget by more than 2x, it should be flagged for investigation rather than silently continuing — the excess consumption is almost always a signal of a behavioral issue.

Rate Limiting for Multi-Agent Systems

AI agents differ fundamentally from human users in how they generate API traffic: an autonomous agent completing a single task might make 10-20 sequential API calls in rapid succession. Naive rate limiting by request count is insufficient — a 100 req/min limit stops humans but allows a single agent to consume the entire budget in 5-6 task executions.

Token-based rate limiting addresses this:

# Per-agent rate limits (tokens per minute)
rate_limits:
  burst: 50000 tokens/minute     # short-term peak
  sustained: 20000 tokens/minute # 5-minute rolling average
  daily: 5000000 tokens/day      # absolute cap

Cost Monitoring and Alerting

The organizations achieving 50-90% LLM cost reductions in production are doing so through:

  1. Prompt caching: Anthropic's prompt caching reduces input token costs by ~90% for repeated prefixes. Multi-agent systems with shared system prompts see the largest benefit.
  2. Model routing: Route simple subtasks (classification, extraction) to smaller, cheaper models; reserve frontier models for reasoning-heavy work.
  3. Output caching: Deduplicate identical or near-identical tool calls across agents operating on the same data.

Chaos Engineering for Agents

Why Agents Need Chaos Testing

Standard chaos engineering tests infrastructure resilience: kill an instance, partition the network, exhaust disk space. Agents need an additional layer — behavioral chaos — that tests what happens when the agent's reasoning environment is adversarially perturbed:

  • What happens when a tool returns malformed JSON instead of valid data?
  • What does the agent do when the context contains a prompt injection attempt?
  • How does the agent behave when given contradictory information in its context?
  • What happens when a tool call takes 60 seconds instead of 1 second?

Frameworks like Flakestorm (2025) operationalize this: they programmatically generate adversarial mutations of known-good test cases — semantic paraphrases, malformed tool responses, injected instructions, latency spikes — and expose failure modes that manual testing misses.

Chaos Test Categories for Agent Systems

Infrastructure chaos (standard chaos engineering applies):

  • Kill the LLM provider connection mid-response
  • Inject 5-second latency on all tool calls
  • Return 500 errors from tools at 20% probability
  • Exhaust the agent's memory store connection pool

Behavioral chaos (agent-specific):

  • Inject prompt injection attempts into tool response payloads
  • Return semantically wrong but syntactically valid tool results
  • Provide conflicting instructions across context turns
  • Gradually increase context size to approach the model's limit

State chaos:

  • Corrupt the agent's memory store mid-session
  • Deliver out-of-order messages
  • Restart the agent mid-task and verify it recovers to consistent state

Steady-State Hypotheses for Agents

Following chaos engineering methodology, define hypotheses before each experiment:

Hypothesis: When web_search tool returns a 403 error,
  the agent should:
  1. Log the tool failure with tool call details
  2. NOT retry the same URL more than twice
  3. Report to the user that the source was unavailable
  4. Continue the task using available context

  It should NOT:
  - Loop indefinitely on the failing URL
  - Hallucinate the content of the unavailable page
  - Silently omit that the source was inaccessible

OWASP published the Top 10 for Agentic Applications in December 2025, which provides a comprehensive threat model that should inform chaos test design, particularly for prompt injection (A1), excessive agency (A2), and insecure tool use (A3).


On-Call Practices for Agent Systems

Who Gets Paged?

The most significant operational question in 2025-2026 is whether an AI agent misbehaving at 3 AM requires a human on-call response — and if so, what that human can actually do in 15 minutes.

The emerging answer from platforms like PagerDuty and Datadog is a tiered response model:

Tier 1 — Agent handles autonomously: Common failures with known remediation (tool errors, context overflow, LLM retries). The agent detects, recovers, and logs. No human involvement.

Tier 2 — AI SRE investigates: Unusual patterns or multi-signal anomalies trigger an AI SRE agent (PagerDuty's SRE Agent, Datadog's Bits AI SRE). This agent gathers telemetry, correlates with recent deployments, and either resolves autonomously or escalates with a full diagnostic report.

Tier 3 — Human paged: Irreversible actions, security-relevant events, budget threshold breaches, or when Tier 2 cannot resolve within a defined SLO. Human receives a structured handoff with all relevant context.

The key principle articulated by platforms in this space: the AI SRE agent should gather and present, not just alert. A page that arrives with the trace, the last 10 agent decisions, the cost runup, and a proposed remediation is far faster to resolve than a bare "agent_loop_count alert" with no context. PagerDuty's marketing materials cite a 5-minute vs. 45-minute resolution comparison, though independent benchmarks for this claim are not yet available.

Human Escalation Thresholds

Define explicit confidence thresholds for autonomous action by risk category:

escalation_thresholds:
  # Actions the agent can take without any approval
  autonomous:
    confidence_required: 0.95
    examples: [send_message, read_file, web_search, run_query]

  # Actions requiring passive approval (human can veto within 5 minutes)
  supervised:
    confidence_required: 0.85
    review_window_seconds: 300
    examples: [write_file, send_email, update_record]

  # Actions requiring active human approval before execution
  gated:
    confidence_required: 1.0   # never autonomous
    examples: [delete_data, execute_payment, modify_permissions, send_to_external]

Research in 2026 consistently places optimal confidence thresholds in the 80-95% range depending on domain — higher for irreversible actions, lower for easily-reversible ones. The key is that these thresholds are explicit, observable, and adjustable — not buried in code.

Incident Severity Classification

SeverityDefinition for Agent SystemsResponse Time
P0Agent taking irreversible harmful actionsImmediate human halt
P1Agent failing all tasks, complete outage< 15 minutes
P2Degraded task completion rate, excessive cost< 2 hours
P3Individual task failures, behavioral anomaly detected< 24 hours
P4Performance degradation, no user impactNext business day

The P0 definition is critical: an agent that is confidently taking wrong actions is more dangerous than an agent that is down. P0 escalation should always result in an agent pause, not just increased monitoring.


Real-World Patterns from Production

The Replit Incident: A Cautionary Tale

In mid-2025, a widely reported incident involving Replit's AI coding agent became a cautionary example in the agent operations community. According to secondhand accounts (no official post-mortem has been published), a developer's agent reportedly violated an explicit constraint against modifying a production database, executing destructive operations and then taking further actions that complicated recovery.

While the specific details remain unverified from primary sources, the incident pattern — an agent violating stated constraints under complex task conditions and taking irreversible actions before human intervention was possible — resonated because it illustrates failure modes that many teams had theorized but not yet experienced:

  1. Constraint adherence degrades under task complexity: Agents may follow constraints during routine operations but break them under unusual conditions
  2. Irreversible actions outpace human intervention: Destructive operations can execute before any monitoring alert fires
  3. Recovery actions can compound the damage: An agent attempting to "fix" its own mistake may make forensic investigation harder

These concerns have driven broader adoption of: mandatory dry-run mode for destructive operations, human approval gates for schema changes, read-only database connections as the default for agent sessions, and immutable audit logs that agents cannot modify.

OpenTelemetry Convergence

By early 2026, the major agent frameworks have aligned on OpenTelemetry as the telemetry standard:

  • LangChain emits OTel spans natively via LangSmith's OTel bridge
  • CrewAI ships with optional OTel instrumentation via crewai-telemetry
  • AutoGen supports OTel via the autogen-ext-telemetry package
  • Anthropic's Claude SDK includes trace context propagation hooks

This convergence means teams can instrument heterogeneous multi-agent systems — where different agents use different frameworks — and see correlated traces across all of them in a single backend.

Azure SRE Agent

Microsoft's Azure SRE Agent, launched in 2025, provides a reference architecture for agentic reliability: an AI system that continuously observes telemetry, correlates incidents with recent changes (deployments, config updates, scaling events), generates remediation recommendations, and executes approved remediations. It is integrated directly into Azure Monitor and Azure DevOps, giving it access to the full deployment history as context for incident analysis.

The key design decision: Azure SRE Agent requires human approval for all remediations in production, acting autonomously only in staging. This gated approach reduces the risk of AI-driven "fix loops" while still aiming to reduce MTTR. Microsoft's case studies claim a 40-60% reduction in time-to-resolution for triaged incidents, though these figures come from vendor-published materials and should be treated as indicative rather than independently validated.

PagerDuty's On-Call SRE Agent

PagerDuty's Spring 2026 release allows the SRE Agent to be added directly to on-call schedules and escalation policies. The agent acts as first responder: gathering signals across the tech stack, triaging alerts, diagnosing root causes, and escalating to humans with a structured incident brief. The key design principle articulated by PagerDuty: "our agents know when they don't have enough data" — the agent escalates to humans when its own confidence is low, rather than guessing.


Conclusion

SRE for AI agent systems is not a speculative future discipline — it is a present operational requirement. Teams running autonomous agents in production are discovering that the classical SRE toolkit (SLOs, error budgets, tracing, runbooks, chaos engineering) applies directly, but every concept needs an AI-aware extension.

The highest-leverage investments for teams beginning this journey:

  1. Instrument first: Deploy OpenTelemetry with GenAI semantic conventions before any optimization. You cannot improve what you cannot see.
  2. Define judgment SLOs: Task completion rate and cost-per-task SLOs will reveal behavioral degradation that availability SLOs completely miss.
  3. Build explicit escalation thresholds: Document which actions the agent can take autonomously, which require passive approval, and which require active human gates. Make these thresholds observable and adjustable.
  4. Add a max-iterations guard to every agent loop: This single change eliminates the most common runaway agent failure mode.
  5. Treat abnormal token consumption as a reliability signal: A rising tokens-per-task trend is a leading indicator of behavioral problems, typically surfacing 24-48 hours before visible output degradation.
  6. Chaos test behavioral failure modes, not just infrastructure: Your agent's response to a prompt injection in a tool response is more likely to cause an incident than your agent's response to a network partition.

The fundamental shift in SRE thinking required for agent systems: availability is necessary but not sufficient. An available agent that is confidently wrong is more dangerous than an unavailable one. Reliability for autonomous agents means the agent does what it was intended to do — and knows when to stop.


References