Zylos Logo
Zylos
2026-03-01

Hierarchical AI Agent Coordination: Task Delegation, Review Loops, and Trust Boundaries

ai-agentsmulti-agentcoordinationdelegationcode-reviewtrust-boundaries

Executive Summary

The AI industry has crossed an inflection point: the dominant architectural pattern for complex AI work is no longer a single capable model but a hierarchy of cooperating agents. A manager agent decomposes goals, delegates subtasks to specialist workers, reviews their outputs, and iterates until quality criteria are satisfied. This mirrors how high-performing human teams operate — and it scales in ways that monolithic single-agent approaches cannot.

Key findings from current research and production deployments:

  • Hierarchical multi-agent systems reduce coordination complexity from O(n²) to O(n) by introducing intermediate management layers
  • Iterative review loops catch 3-5x more defects than single-pass review, but each additional iteration yields diminishing returns after round 3-4
  • Security degradation is a documented failure mode: code subjected to five or more AI improvement iterations shows a 37.6% increase in critical vulnerabilities
  • The global agentic AI market is growing from $7.06 billion in 2025 to a projected $93.2 billion by 2032 (44.6% CAGR)
  • Only ~5% of enterprises had AI agents in production at the start of 2025; Gartner projects 40% of enterprise applications will feature task-specific agents by 2026
  • Two new protocols — Anthropic's MCP (vertical tool access) and Google's A2A (horizontal agent coordination) — are becoming the interoperability backbone of multi-agent systems

This article examines the architecture, protocols, quality loops, trust models, failure modes, and observability patterns that distinguish robust hierarchical agent systems from brittle ones.


The Shift from Single-Agent to Multi-Agent Systems

Why Monolithic Agents Hit a Ceiling

The first wave of AI agents treated the language model as a single omniscient entity: give it all the context, ask it to complete a complex task, and receive the answer. This works for bounded tasks but breaks down under three pressures:

Context overflow: Complex real-world tasks — write a full-stack feature, conduct competitive analysis, audit an entire codebase — exceed the practical context window of any single model call. Stuffing everything into one prompt degrades coherence and introduces conflation errors.

Specialization gaps: A general-purpose agent asked to simultaneously reason about database schemas, write idiomatic React, and draft a compliance document will perform worse on each than a specialist agent focused on one domain. Models fine-tuned or prompted for specific roles outperform generalists on those roles.

Parallelism constraints: Sequential single-agent execution means every subtask blocks the next. A multi-agent system can fan out: three analysts, one writer, one reviewer operating concurrently completes the same work in a fraction of the time.

The Team Analogy

The shift to multi-agent architectures mirrors how high-performing human organizations work. A project manager does not execute all tasks personally — they decompose the project, assign work to specialists, review deliverables, and iterate. The AI equivalent is a manager agent that:

  1. Receives a high-level goal
  2. Decomposes it into a dependency graph of subtasks
  3. Assigns each subtask to the most capable available specialist agent
  4. Monitors progress and handles exceptions
  5. Aggregates outputs into a coherent whole
  6. Validates the result against the original specification

This is now the standard pattern recommended by every major framework — CrewAI, LangGraph, AutoGen, Google ADK, and OpenAI's Agents SDK — as well as production platforms at AWS, Microsoft Azure, and Google Cloud.


Hierarchical Coordination Architectures

Manager-Worker Pattern

The simplest hierarchical form is a two-tier manager-worker system. A single manager agent receives the user request and breaks it into subtasks. Worker agents execute those subtasks and return results. The manager synthesizes the results and delivers the final output.

User Request
     │
     ▼
┌────────────┐
│  Manager   │  ← Decomposes, delegates, synthesizes
│   Agent    │
└────────────┘
     │   │   │
     ▼   ▼   ▼
┌──────┐ ┌──────┐ ┌──────┐
│ Web  │ │Code  │ │Data  │
│Search│ │Writer│ │Analyst│
└──────┘ └──────┘ └──────┘

This pattern works well when subtasks are independent (can be parallelized) and when the manager can evaluate worker outputs directly.

Strengths: Simple, observable, easy to debug. Clear ownership of each subtask.

Weaknesses: Manager becomes a bottleneck. If worker outputs require sequential dependency, the pattern loses parallelism advantages. Single manager is a single point of failure.

Supervisor with Sub-Orchestrators

For larger systems, intermediate layers of supervision become necessary. A top-level orchestrator delegates not to individual workers but to sub-orchestrators, each of which manages its own team of specialists.

Top-Level Orchestrator
         │
    ┌────┴────┐
    ▼         ▼
Research   Engineering
Supervisor  Supervisor
    │           │
  ┌─┴─┐     ┌──┴──┐
  │   │     │     │
Web  Data  Code  Test
Agent Analyst Writer Runner

This pattern reduces cognitive load on the top-level agent (it only manages N supervisors, not N*M workers) and enables domain-specific expertise at each management layer. Coordination complexity drops from O(n²) for flat architectures to roughly O(n) per layer.

Microsoft's multi-agent reference architecture, AWS's agentic system documentation, and Google's ADK all recommend this pattern for systems with more than 5-6 agents.

Orchestrator-Executor Model

A variant used heavily in workflow automation distinguishes between orchestrators (which decide what to do) and executors (which decide how to do it). The orchestrator maintains the high-level plan and task graph. Executors are specialized for specific execution contexts — code execution, API calls, browser automation, database queries — and report results back.

This pattern is prominent in Azure AI Foundry, where the orchestrator layer handles planning and the executor layer handles tool invocation with specific capability scopes.

Swarm / Peer Delegation

Not all multi-agent architectures are strictly hierarchical. OpenAI's Swarm framework and its successor, the Agents SDK, implement peer delegation: any agent can hand off to any other agent based on capability matching. There is no fixed hierarchy — the routing is dynamic and driven by the agent's self-assessment of whether it should handle a task or transfer it.

# OpenAI Agents SDK handoff pattern
def transfer_to_billing_agent():
    """Transfer to billing specialist when payment questions arise."""
    return billing_agent  # Returns agent object, triggering handoff

triage_agent = Agent(
    name="Triage",
    instructions="Route requests to the right specialist.",
    handoffs=[transfer_to_billing_agent, transfer_to_technical_agent]
)

The Swarm/Agents SDK approach is lightweight but trades observability for flexibility. Without a clear hierarchy, debugging failures in a peer delegation network is harder — responsibility is diffuse.


Task Delegation Protocols

Decomposition Strategies

Effective task delegation begins with decomposition. How a manager agent breaks down a complex goal directly determines whether the resulting subtasks are actionable, well-scoped, and parallelizable.

Three decomposition strategies dominate current implementations:

Goal-based decomposition: Start with the end state and work backwards. "Deploy a microservice" decomposes into: write code, write tests, write Dockerfile, set up CI config, write deployment manifest, write documentation. Each piece is independently specifiable.

Skill-based decomposition: Match task fragments to available agent capabilities. If you have a web search agent, a code writing agent, and a data analysis agent, decompose the task along those capability boundaries rather than along logical structure.

Dependency-graph decomposition: Build a DAG of subtasks where edges represent data dependencies. Nodes with no incoming edges can be parallelized immediately. Nodes with incoming edges wait for their dependencies. This is the most sophisticated approach and is implemented explicitly in frameworks like LangGraph and Google ADK.

Context Handoff Formats

When a manager delegates a subtask, it must package enough context for the worker to act without requiring back-and-forth clarification. Poorly formatted context handoffs are one of the primary causes of quality degradation in multi-agent pipelines.

The Model Context Protocol (MCP), launched by Anthropic in November 2024, has emerged as a standardized way to package and transfer context. MCP defines resource formats for passing task-specific context directly from one agent to another as part of delegation, ensuring the receiving agent inherits not just task parameters but relevant computational state.

A well-structured delegation message typically contains:

{
  "task_id": "task-abc-123",
  "objective": "Write a Python function that validates email addresses using RFC 5322",
  "constraints": [
    "Must handle international domains",
    "Must return (bool, str) — valid flag and error message",
    "Must include docstring with examples"
  ],
  "context": {
    "codebase_language": "Python 3.11",
    "style_guide": "PEP 8 with 88-char line limit (Black formatter)",
    "existing_validator_interface": "...",
    "test_framework": "pytest"
  },
  "verification_spec": {
    "method": "LLM_JUDGE",
    "criteria": "Code handles all RFC 5322 edge cases, docstring is complete, function matches interface"
  },
  "priority": "high",
  "reversibility": "high",
  "max_iterations": 3
}

The delegato framework formalizes this structure, treating tasks as atomic units of work carrying a goal, required capabilities, verification spec, priority, complexity estimate, and reversibility level. The verification spec supports five methods: LLM_JUDGE (subjective quality assessment), REGEX (pattern matching), SCHEMA (JSON structure validation), FUNCTION (custom logic), and NONE (trust the worker).

Google A2A vs. Anthropic MCP

Two protocols are becoming the de facto standards for agent interoperability in 2025-2026:

Anthropic's Model Context Protocol (MCP) — a vertical protocol: it standardizes how a single agent connects to tools, databases, and external services. Think of it as a plugin system. MCP defines how an agent requests resources, receives results, and manages tool invocations.

Google's Agent-to-Agent (A2A) — a horizontal protocol: it standardizes how peer agents communicate with each other — task delegation, capability discovery, status reporting, handoffs. Announced April 9, 2025, with backing from 50+ major tech companies.

The relationship is complementary: MCP handles what an individual agent can do (its tool belt), while A2A handles how agents coordinate with each other (the communication layer). A production multi-agent system uses both: A2A for inter-agent coordination and MCP for each agent's tool access.


Review and Quality Convergence Loops

The Iterative Review Pattern

Static, single-pass review — submit work, receive feedback once, done — has a fixed detection ceiling. Every model has systematic blind spots based on its training distribution. Multi-round iterative review breaks through this ceiling by exploiting the fact that different review passes surface different defect classes.

The canonical iterative review loop:

┌─────────────────────────────────────────────┐
│  Worker generates output (code, document,   │
│  analysis, etc.)                            │
│              ↓                              │
│  Reviewer examines full output (not just    │
│  diff) against specification               │
│              ↓                              │
│  Reviewer outputs: list of confirmed issues │
│  with severity and suggested fixes          │
│              ↓                              │
│  Worker revises based on feedback           │
│              ↓                              │
│  Repeat until: zero confirmed issues OR     │
│  max iterations reached OR quality score    │
│  exceeds threshold                          │
└─────────────────────────────────────────────┘

Research benchmarked at FSE 2025 shows agentic iterative approaches outperform single-shot attempts by 21% on complex coding tasks. The quality gain is concentrated in the first 2-3 rounds: round 1 catches the most defects, round 2 catches issues introduced by round 1 fixes, and round 3 provides a clean sweep. Rounds 4+ show sharply diminishing returns.

Google ADK Loop Agents

Google's Agent Development Kit provides a first-class LoopAgent primitive for exactly this pattern. The LoopAgent executes its sub-agents in a cycle until a termination condition fires:

from google.adk.agents import LoopAgent, LlmAgent

code_writer = LlmAgent(
    name="CodeWriter",
    model="gemini-2.0-flash",
    instruction="Write the requested code. If reviewer feedback exists, apply all corrections.",
    output_key="current_code"
)

code_reviewer = LlmAgent(
    name="CodeReviewer",
    model="gemini-2.0-flash",
    instruction="""Review the code in {current_code}.
    If all issues are resolved, respond with 'APPROVED'.
    Otherwise list each issue with severity and fix suggestion.""",
    output_key="review_result"
)

# Termination check agent
exit_checker = LlmAgent(
    name="ExitChecker",
    instruction="If review_result contains 'APPROVED', signal escalate=True.",
)

review_loop = LoopAgent(
    name="IterativeCodeReview",
    sub_agents=[code_writer, code_reviewer, exit_checker],
    max_iterations=5  # Hard cap on iterations
)

The critical design point: the LoopAgent itself does not decide when to stop. A sub-agent must signal escalate=True via EventActions when the quality threshold is met. Without this termination signal, the loop runs to max_iterations — a necessary safety valve.

Convergence Patterns and Stopping Criteria

Determining when to stop is one of the hardest problems in iterative review. Common strategies:

Fixed iteration count: Run exactly N review rounds. Simple but wastes compute if quality is achieved early, or misses remaining issues if N is too small.

Quality score threshold: A meta-evaluator scores the output after each round. Stop when the score exceeds a threshold (e.g., 0.95 on a 0-1 scale). Requires a reliable scoring function.

Zero-defect convergence: Run until the reviewer reports no confirmed issues. The "confirmed" qualifier is critical — reviewers often report suspected issues that are actually false positives. The reviewer must be prompted to distinguish confirmed defects from speculative concerns.

Diminishing returns detection: Track the number of new issues found per round. Stop when the round-over-round decline in new findings exceeds a threshold (e.g., fewer than 2 new issues in the latest round).

Time/cost budget: Stop when elapsed time or token cost exceeds a budget. Practical for production systems with SLA requirements.

Production implementations typically combine these: a quality threshold as the primary condition, a max iterations cap as a safety valve, and a cost budget as an override.

The Security Degradation Paradox

A critical finding from IEEE-ISTAS 2025 research deserves prominent attention. Researchers analyzed 400 code samples subjected to iterative AI improvement across 40 rounds. They found a 37.6% increase in critical vulnerabilities after just five iterations.

The mechanism: when an AI iteratively "improves" code, it may refactor structures in ways that introduce subtle security issues — insecure defaults, missing input validation, changed trust assumptions — even when explicitly prompted to improve security. This is the security degradation paradox: iterative improvement optimizes for the most recent feedback, potentially degrading properties not explicitly checked in that round.

Practical countermeasures:

  • Run a dedicated security review agent as a separate pass, distinct from functional review
  • Use SAST tooling (not LLM judgment) for security checks between rounds
  • Cap functional iteration loops at 3-4 rounds and follow with a security-focused final pass
  • Maintain a security baseline diff: compare final output to original for security-relevant constructs

Trust Boundaries and Permission Scoping

The Delegation Trust Pyramid

Not everything should be delegated. The central challenge in multi-agent systems is determining what a worker agent is authorized to do without human confirmation. This determines both safety and utility — over-restriction creates a system that constantly asks for permission, while under-restriction creates uncontrolled autonomous action.

AWS's Agentic AI Security Scoping Matrix provides a useful framework. Actions are classified along two axes:

Impact axis: How severe is the consequence if the agent makes a mistake?

  • Low: Read-only operations, generating drafts, analysis
  • Medium: Write operations to reversible stores, sending messages
  • High: Irreversible writes, financial transactions, infrastructure changes

Reversibility axis: Can the action be undone?

  • High reversibility: Version-controlled changes, staged deployments, sandbox environments
  • Low reversibility: Database deletions, sent emails, executed payments, deployed infrastructure

Actions in the high-impact, low-reversibility quadrant require human approval before execution. Actions in the low-impact, high-reversibility quadrant can be fully automated.

Zero-Trust Tool Access

The principle of least privilege applies strongly to agent tool access. Best practices from production deployments:

Short-lived tokens: Provide agents with service tokens that expire after the task, not persistent credentials. This limits blast radius if an agent is compromised or misbehaves.

Capability scoping: An agent that needs to update CRM records should not have access to financial systems or HR data. Scope permissions to the minimum set required for the assigned task, not the agent's maximum theoretical need.

Just-in-time elevation: For high-privilege operations, agents request elevated access at the moment of need rather than holding it throughout their lifecycle. The request triggers a human approval workflow or an automated policy check.

Scope per-agent, not per-system: Different worker agents can have different permission sets even within the same pipeline. A research agent gets read-only web access. A code writing agent gets write access to a sandbox environment. A deployment agent gets write access to staging but not production.

Escalation Patterns

When an agent encounters a situation that exceeds its authorization scope, it must have a well-defined escalation path. Common patterns:

Synchronous escalation: The agent pauses, notifies a human or supervisor agent, and waits for approval before proceeding. This maintains sequential correctness but introduces latency.

Async escalation with checkpoint: The agent completes all work it can within its authorization scope, checkpoints its state, and sends an escalation request. Work resumes when approval arrives. Reduces latency for non-blocking portions.

Scope expansion request: The agent requests a temporary permission expansion from a supervisor agent, which can grant or deny it based on policy rules. This keeps humans out of routine approval flows while maintaining oversight for exceptional cases.

Graceful degradation: If escalation cannot be completed within a time budget, the agent delivers partial results with a clear indication of what was not completed due to authorization constraints.

Human-in-the-Loop Checkpoints

Production systems that handle high-stakes actions (financial operations, infrastructure changes, content publishing) implement explicit human-in-the-loop checkpoints. The agent pipeline pauses at designated checkpoint nodes and surfaces a summary of planned actions for human review.

The design of these checkpoints is critical. If agents create too many checkpoints or present them in confusing formats, humans will reflexively approve without reading — a failure mode sometimes called "approval fatigue" or "authorization habituation." Best practices:

  • Aggregate related approvals into a single decision surface
  • Present diffs and previews, not just text descriptions
  • Reserve checkpoints for genuinely consequential actions
  • Automate routine approvals that meet predefined policy criteria

Framework Implementations

CrewAI: Role-Based Hierarchical Teams

CrewAI implements a hierarchical process where a manager agent coordinates a crew of specialist agents. The manager can be auto-generated by the framework or explicitly configured by the developer.

from crewai import Agent, Crew, Task, Process

# Specialist agents — delegation disabled, tools enabled
researcher = Agent(
    role="Research Analyst",
    goal="Find accurate, current information on assigned topics",
    backstory="Expert researcher with access to web search tools",
    tools=[web_search_tool, arxiv_tool],
    allow_delegation=False  # Specialists don't re-delegate
)

writer = Agent(
    role="Technical Writer",
    goal="Produce clear, well-structured technical documents",
    tools=[],
    allow_delegation=False
)

reviewer = Agent(
    role="Quality Reviewer",
    goal="Ensure all outputs meet quality and accuracy standards",
    tools=[],
    allow_delegation=False
)

# Manager agent with explicit LLM and delegation enabled
manager = Agent(
    role="Project Manager",
    goal="Coordinate the team to deliver high-quality research reports",
    allow_delegation=True,
    llm="gpt-4o"  # Manager uses a more capable model
)

# Crew with hierarchical process
crew = Crew(
    agents=[researcher, writer, reviewer],
    manager_agent=manager,
    process=Process.hierarchical,
    verbose=True
)

Key evolution in 2025: CrewAI introduced the allowed_agents parameter (PR #2068), giving managers fine-grained control over which agents can be delegated to in a given context, enabling conditional delegation trees rather than flat "delegate to anyone" patterns.

A documented bug as of April 2025 (Issue #2606): the DelegateWorkToolSchema fails when the manager passes dictionary objects for task and context parameters where the schema expects strings — a type validation error that surfaces in complex hierarchical workflows.

LangGraph: State-Machine Supervision

LangGraph models multi-agent workflows as directed graphs where nodes are agents and edges are control flow. The supervisor pattern uses a special supervisor node that examines the current state and decides which agent to invoke next.

from langgraph.graph import StateGraph, END
from langgraph.prebuilt import create_react_agent

# Create specialist agents
researcher = create_react_agent(llm, tools=[web_search, arxiv])
coder = create_react_agent(llm, tools=[code_executor, file_system])
reviewer = create_react_agent(llm, tools=[static_analyzer])

# Supervisor decides routing
def supervisor(state):
    """Route to the next agent based on task state."""
    messages = state["messages"]
    response = supervisor_llm.invoke(
        system_prompt + "\nRouting options: researcher, coder, reviewer, FINISH",
        messages
    )
    return {"next": response.next_agent}

# Build the graph
workflow = StateGraph(AgentState)
workflow.add_node("supervisor", supervisor)
workflow.add_node("researcher", researcher)
workflow.add_node("coder", coder)
workflow.add_node("reviewer", reviewer)

# Supervisor routes to agents; agents return to supervisor
for agent in ["researcher", "coder", "reviewer"]:
    workflow.add_edge(agent, "supervisor")

workflow.add_conditional_edges(
    "supervisor",
    lambda x: x["next"],
    {"researcher": "researcher", "coder": "coder",
     "reviewer": "reviewer", "FINISH": END}
)

graph = workflow.compile()

LangGraph's strength is explicit state management: every agent interaction modifies a shared state object, making it possible to inspect exactly what happened at every step. This is critical for debugging complex multi-agent interactions and for implementing checkpointing and rollback.

The LangChain team now recommends using the supervisor pattern directly via tool calls rather than importing the langgraph-supervisor library — the tool-calling approach provides more control over context engineering.

AutoGen / AG2: Conversational Multi-Agent

AutoGen (now maintained as AG2 by the open-source community after Microsoft restructured its agentic offerings) implements multi-agent coordination through conversational patterns. Agents exchange messages in group chats or nested conversations, with a GroupChatManager controlling turn order.

AutoGen 0.4 (January 2025) was a complete redesign focused on production robustness. The Microsoft Agent Framework (released as public preview October 2025) merges AutoGen's multi-agent orchestration with Semantic Kernel's production foundations.

AutoGen's distinctive feature is nested conversations: an agent can initiate a sub-conversation with another agent to resolve a subtask, then return the result to the parent conversation. This enables recursive delegation without explicit hierarchy definition.

Key design patterns in AutoGen:

  • Critic-executor pattern: One agent proposes solutions, another critiques them, loop until critic approves
  • Sequential chain: Agents pass work product sequentially, each adding or transforming it
  • Parallel fan-out: Manager spawns multiple concurrent conversations, aggregates results

OpenAI Agents SDK: Routines and Handoffs

OpenAI's Agents SDK (the production successor to the experimental Swarm framework) centers on two primitives:

Routines: Sequences of instructions an agent follows to complete a task. A routine is essentially a structured prompt with embedded decision logic.

Handoffs: Mechanisms by which one agent transfers control to another. Implemented via tool calls — an agent's tool list includes functions that return other agent objects, triggering a context switch.

from openai.agents import Agent, handoff

billing_agent = Agent(
    name="BillingSpecialist",
    instructions="Handle all billing inquiries including refunds, payment issues, and plan changes."
)

technical_agent = Agent(
    name="TechnicalSupport",
    instructions="Diagnose and resolve technical issues. Escalate hardware failures to human support."
)

triage_agent = Agent(
    name="TriageAgent",
    instructions="""Classify the user's issue and route to the appropriate specialist.
    - Billing questions → billing_specialist
    - Technical problems → technical_support
    - Everything else → handle directly""",
    handoffs=[
        handoff(billing_agent),
        handoff(technical_agent)
    ]
)

The Agents SDK is stateless between calls by design — no persistent state is maintained across the run() function boundary. This simplifies reasoning about agent behavior but requires callers to manage conversation history explicitly.

Google ADK: Workflow Primitives

Google's Agent Development Kit provides the most explicit workflow abstractions of any current framework:

  • SequentialAgent: Runs sub-agents in order, passing outputs as inputs
  • ParallelAgent: Runs sub-agents concurrently, merges outputs
  • LoopAgent: Runs sub-agents in a cycle until termination signal or max iterations

These workflow primitives compose: a LoopAgent containing a ParallelAgent enables iterative refinement where each iteration fans out multiple specialized evaluators.

ADK agents communicate through a shared session context, with explicit input_key / output_key parameters controlling which fields each agent reads from and writes to. This prevents unintended data coupling between agents that share context.


Failure Modes and Anti-Patterns

Delegation Loops

The most insidious failure mode: Agent A delegates to Agent B, which delegates back to Agent A (or to Agent C which delegates to Agent A), creating an infinite cycle. Without cycle detection and termination rules, these loops exhaust token budgets and time limits while producing no useful output.

Causes:

  • Ambiguous task boundaries allow agents to claim a task is "out of scope" and re-delegate
  • Manager agents without clear stopping conditions keep re-assigning work they perceive as incomplete
  • Peer delegation networks (like Swarm) without cycle prevention

Prevention:

  • Assign each task a unique ID; reject delegation attempts for tasks already in the call stack
  • Implement a maximum delegation depth (default: 3 hops in most frameworks)
  • Distinguish "delegate" from "request information from" — only the former transfers task ownership

Context Loss During Handoffs

When a manager passes a task to a worker, critical context can be dropped. The worker then operates with incomplete information, producing outputs that don't fit the larger pipeline.

Common contexts that get lost:

  • Constraints established in earlier conversation turns
  • Intermediate outputs from sibling agents that inform this task
  • Style and format requirements from the original specification
  • Error conditions from previous failed attempts

Prevention:

  • Use structured context packages (like the task format described earlier) rather than free-text delegation
  • Implement a context continuity check: require workers to echo back their understanding of key constraints before proceeding
  • Store shared context in a persistent state object (as LangGraph does) rather than relying on message threading

Blocking Chains and Parallelism Loss

When agents invoke downstream agents synchronously — waiting for each before proceeding — the pipeline becomes a sequential chain. This eliminates the parallelism benefits of multi-agent architecture and accumulates latency at every link.

# Anti-pattern: synchronous blocking chain
result_1 = await agent_1.run(task)
result_2 = await agent_2.run(result_1)  # Waits for agent_1
result_3 = await agent_3.run(result_2)  # Waits for agent_2

Any failure in the chain blocks all upstream agents. In long chains, this creates cascading failures that are difficult to recover from.

Prevention:

  • Model task dependencies explicitly and parallelize independent branches
  • Use event-driven architectures where agents publish results and downstream agents subscribe
  • Implement circuit breakers at each chain node to fail fast rather than block indefinitely

Quality Regression in Iterative Review

Each review round asks the worker to fix identified issues. But fixing one issue can introduce another — a well-documented phenomenon in software maintenance sometimes called the "fix churn." In multi-agent systems this manifests as oscillation: the worker fixes issue A in round 2, inadvertently re-introduces a variant of issue A in round 3, and so on.

Causes:

  • Worker agents optimize for the most recent reviewer feedback, not for the complete history of issues
  • Reviewer agents evaluate diffs rather than the full output, missing regressions outside the changed sections
  • Instruction conflicts between rounds are not resolved by the worker

Prevention:

  • Pass the full issue history to workers, not just the latest feedback
  • Require reviewers to check the full output each round, not just changes
  • Implement a regression detector: compare the current round's findings against all previous rounds, flag any issue that was resolved and has re-appeared

Role Confusion in Flat Topologies

In flat multi-agent setups where all agents are peers, role boundaries blur. Two agents may both attempt to act as planner, or neither may take responsibility for integration. Without clear hierarchical authority, decisions stall or conflict.

Prevention: Prefer explicit hierarchy over flat peer networks for complex tasks. Use flat topologies only when agents have non-overlapping, clearly delimited tool sets.

Runaway Cost Spirals

Iterative loops without hard limits on iterations and token budgets can consume enormous resources. A system tasked with iterating until convergence, where convergence is never reached due to an ill-specified quality criterion, will run indefinitely.

Prevention:

  • Set absolute limits on iterations (max_iterations) and token budgets at the framework level
  • Implement cost monitoring with circuit breakers that trigger automatic termination and human notification
  • Design quality criteria that are achievable — specify what "done" looks like before starting the loop

Metrics and Observability for Multi-Agent Systems

The Four Pillars of Agent Observability

Traditional observability (logs, metrics, traces) extends into the AI agent domain with additional AI-specific signals. The OpenTelemetry community's 2025 guidance on AI agent observability identifies four pillars:

1. Traces: Distributed traces that span agent boundaries, showing the full causal chain from user request to final output. Multi-agent tracing must capture agent invocations, tool calls, delegation events, and context handoffs as first-class spans. Tools like LangSmith, Langfuse, and Arize Phoenix provide framework-native agent tracing.

2. Metrics: Quantitative performance measurements:

  • Task completion rate (fraction of delegated tasks completed successfully)
  • Review convergence rate (fraction of iterative loops that converge before hitting max_iterations)
  • Mean rounds to convergence (average number of review iterations needed)
  • Agent utilization (fraction of time each agent is actively processing vs. idle/waiting)
  • Delegation depth distribution (how deep delegation chains go in practice)
  • Token cost per task (total tokens consumed across all agents for a task)

3. Logs: Structured event logs capturing agent decisions, delegation events, and quality assessments. The key requirement: logs must be agent-scoped, not just request-scoped, so it is possible to reconstruct what each agent did independently.

4. AI-specific signals:

  • Context relevance scores (how well the context passed to an agent matches the task)
  • Tool selection accuracy (did agents choose the right tools for each step)
  • Review finding rates (confirmed issues per review round, trending toward zero at convergence)
  • Hallucination indicators (factual claims that cannot be verified against sources)

Delegation Effectiveness Metrics

Measuring whether delegation is actually working requires metrics that span the manager-worker relationship:

Rework rate: What fraction of delegated tasks are returned to workers for revision? A high rework rate (above 30-40%) suggests either the delegation specification was poor or the worker agent is under-capable for the assigned tasks.

First-attempt success rate: What fraction of delegated tasks are completed correctly on the first try, with no review feedback required? Trending this metric over time shows whether the system is learning to specify tasks more clearly.

Delegation hit rate: When the manager decides to delegate vs. handle itself, does delegation produce better outcomes? If not, delegation adds overhead without quality benefit.

Convergence waste: In review loops, what fraction of review rounds are "empty" (reviewer finds no new issues but loop hasn't terminated)? Empty rounds indicate the termination condition is too conservative.

Tooling Landscape

Current observability tooling for multi-agent systems:

  • LangSmith: Native LangChain/LangGraph tracing with agent-level spans, cost tracking, and evaluation runs
  • Langfuse: Framework-agnostic, open-source agent observability with scoring and dataset management
  • Arize Phoenix: Focused on LLM evaluation and tracing, strong support for multi-agent spans
  • AgentOps: Real-time monitoring with agent session replay
  • OpenTelemetry AI Semantic Conventions: Emerging standard for emitting agent telemetry in a framework-agnostic format; multiple frameworks now emit OTel-compliant spans

Microsoft's multi-agent reference architecture recommends OpenTelemetry as the interoperability layer: each framework emits OTel traces, which are collected and analyzed by a platform-agnostic backend.


Production Lessons and Case Studies

The State of Production Deployments in 2025

The first large-scale study of production AI agent systems (published late 2025) found that only 95 of 1,837 surveyed organizations had AI agents in production — approximately 5%. But the characteristics of those 95 deployments reveal what separates successful production systems from stalled pilots:

Successful deployments share three traits:

  1. Narrow task scope: Agents focused on a specific, well-defined workflow rather than general-purpose automation
  2. Human review checkpoints for high-stakes actions: No fully autonomous operation on irreversible consequences
  3. Investment in observability before scale: Logging, tracing, and cost monitoring in place before expanding agent scope

Failed or stalled deployments share three traits:

  1. Under-specified task scope: Agents given vague goals without clear success criteria
  2. Over-trust in agent outputs: Insufficient human validation of high-stakes decisions
  3. No observability infrastructure: Inability to debug failures or understand cost drivers

AtlantiCare Clinical Assistant

A production case study that illustrates effective hierarchical agent design: AtlantiCare deployed an agentic AI clinical assistant to 50 healthcare providers. The system used a coordinator agent that delegated to specialized agents for documentation drafting, coding suggestion, and compliance checking. A human-in-the-loop checkpoint required physician review before any note was finalized.

Results: 80% adoption rate among participating providers, 42% reduction in documentation time, approximately 66 minutes saved per provider per day. The checkpoint — rather than impeding adoption — was credited with building trust that enabled the high adoption rate.

Amazon's Evaluation Framework Lessons

AWS published a retrospective on building agentic systems internally that surfaced several production lessons:

  1. Evaluation must extend beyond accuracy: Agentic systems require measuring reasoning coherence (is the agent's chain of thought sound?), tool selection accuracy (is it choosing the right tools?), and task completion rate — not just output correctness.

  2. Multi-agent systems fail at boundaries: The most common failure points are agent handoffs, not individual agent behavior. Invest disproportionately in testing and observing the delegation layer.

  3. Cost surprises are the most common production incident: Token costs in iterative loops scale non-linearly with task complexity. Budget alerts and cost circuit breakers are not optional.

  4. Latency compounds: In hierarchical systems, each delegation hop adds network round-trip latency plus LLM inference time. A 5-hop delegation chain with 2-second average agent response times produces a 10-second minimum latency before any result returns.

Google Cloud's 2025 Retrospective

Google Cloud's Office of the CTO published a 2025 lessons report on enterprise agent deployments. Key findings:

  • The organizational bottleneck has shifted from model capability to trust: enterprises are not held back by what agents can do but by what they trust agents to do autonomously
  • Governance and observability infrastructure is the rate-limiting investment for scaling from pilot to production
  • The most valuable agent use cases in 2025 were not creative or generative tasks but operational tasks: data processing, document handling, routine decision support

Future Directions

Learned Delegation Policies

Current orchestrators use LLM judgment to route tasks — the supervisor reads the task description and decides which agent to invoke. This is flexible but expensive and inconsistent. Emerging research explores training lightweight routing models specifically for delegation: classifiers that can route tasks with high accuracy at far lower cost than invoking a frontier model for routing decisions.

Self-Organizing Agent Networks

The A2A protocol's capability registry mechanism — where agents advertise their capabilities and other agents discover them dynamically — enables emergent team formation. Rather than a human or orchestrator pre-assigning agents to roles, agents self-organize based on declared capabilities and current availability. This is analogous to a marketplace for agent labor: tasks are posted with requirements, and agents bid based on their capabilities.

Formal Verification of Agent Workflows

As multi-agent systems take on higher-stakes decisions, informal testing is insufficient. Researchers are applying formal methods — model checking, theorem proving — to verify properties of agent coordination protocols. Questions like "can this system reach a state where no agent is authorized to approve its own request?" become formally provable rather than empirically tested.

Constitutional Constraints for Agent Hierarchies

Anthropic's Constitutional AI approach applies at the individual agent level, but multi-agent systems need constitutional constraints that span hierarchies: rules about what types of tasks can be delegated to what types of agents, what requires human approval, and what is prohibited entirely regardless of who requests it. Encoding these as formal policies that the coordination layer enforces (not just the individual agents) is an active area of development.

Adaptive Quality Convergence

Static iteration limits and quality thresholds are blunt instruments. Adaptive systems could learn, from past task execution data, which types of tasks converge quickly vs. slowly, and dynamically adjust iteration budgets and review stringency based on task characteristics. A routine code formatting task might need one review round; a complex security-sensitive refactor might need five, with a dedicated security reviewer injected in the final round.


References