Actor Model and Communicating Agent Patterns for AI Multi-Agent Systems

Executive Summary

The AI industry's shift toward multi-agent systems in 2025-2026 is not a novel engineering challenge — it is a rediscovery of problems that distributed systems research solved decades ago. The Actor Model (1973), Communicating Sequential Processes (1978), and supervision tree patterns from Erlang/OTP (1986) provide battle-tested blueprints for the exact problems AI agent frameworks now face: how agents communicate without corrupting shared state, how to recover when an agent crashes mid-task, how to suspend execution while waiting for human input, and how to coordinate thousands of concurrent agents without deadlocks.

This article maps the classical concurrency foundations onto modern AI agent architectures, comparing how frameworks like LangGraph, CrewAI, AutoGen, and Anthropic's orchestrator-worker pattern implement — or fail to implement — these proven patterns. The goal is practical: if you are designing a multi-agent system where agents need to escalate to parents, communicate with peers, and interact with external systems, understanding the Actor Model is not optional background reading — it is the engineering foundation.

Key findings:

Message-passing over shared state is now the dominant pattern in production multi-agent systems, matching the Actor Model's core principle from 50 years ago
Supervision trees from Erlang/OTP map directly to orchestrator-worker hierarchies in AI agent frameworks, but most frameworks lack proper fault recovery strategies
Durable execution (suspend/resume with checkpointed state) is the critical missing piece — LangGraph's interrupt/resume and Restate's journaled execution are the closest implementations
Virtual actors (Microsoft Orleans) offer the most promising model for AI agents that need managed lifecycle, automatic activation/deactivation, and location-transparent communication

Classical Foundations

The Actor Model

The Actor Model, proposed by Carl Hewitt in 1973 and formalized by Gul Agha in the 1980s, defines computation as a collection of independent entities ("actors") that interact exclusively through asynchronous message passing. Each actor has three fundamental capabilities:

Send messages to other actors it knows about
Create new actors (spawn)
Define behavior for the next message it receives (state transition)

The critical constraint: actors share nothing. No shared memory, no global state, no locks. All coordination happens through messages. This constraint, which seemed restrictive when competing with shared-memory threading models, turns out to be exactly what AI agent systems need.

Actor A                    Actor B
┌──────────┐              ┌──────────┐
│ State    │   message    │ State    │
│ Behavior │ ──────────>  │ Behavior │
│ Mailbox  │              │ Mailbox  │
└──────────┘              └──────────┘
     │                         │
     │  spawn                  │  send
     v                         v
┌──────────┐              ┌──────────┐
│ Actor C  │              │ Actor D  │
└──────────┘              └──────────┘

Key properties:

Location transparency: An actor's address is opaque — the sender does not know (or care) whether the recipient is on the same machine, across the network, or has been migrated to a different node
Sequential message processing: Each actor processes messages from its mailbox one at a time, eliminating concurrency bugs within a single actor without any locking
No return values: Sending a message is fire-and-forget. If you need a response, the receiver sends a new message back. This decoupling is what enables true asynchrony

Communicating Sequential Processes (CSP)

Tony Hoare's CSP model (1978) takes a different approach. Where actors are named entities with mailboxes, CSP uses anonymous processes that communicate through named channels.

Property	Actor Model	CSP
Communication	Actor-to-actor (named recipient)	Through channels (named conduit)
Messaging	Asynchronous (non-blocking send)	Synchronous (sender blocks until receiver reads)
Identity	Actors have identity	Processes are anonymous
Buffering	Mailbox queues messages	Channel holds at most one message (unbuffered)
Distribution	Natural fit for distributed systems	Primarily single-machine
Error model	"Let it crash" + supervision	Channel closure propagation

Go's goroutines and channels are the most prominent CSP implementation today:

// CSP: anonymous processes communicating through a named channel
results := make(chan SearchResult)

// Spawn anonymous workers — they have no identity
go func() { results <- searchWeb(query) }()
go func() { results <- searchDatabase(query) }()
go func() { results <- searchCache(query) }()

// Collect results through the channel
for i := 0; i < 3; i++ {
    result := <-results  // blocks until a result arrives
    process(result)
}

Why this matters for AI agents: CSP's synchronous channels provide natural backpressure — a fast producer cannot overwhelm a slow consumer. But the Actor Model's asynchronous mailboxes better match AI agent reality, where LLM calls take variable time and agents must remain responsive while waiting. Most AI agent frameworks have converged on actor-style async messaging, even when they do not explicitly reference the Actor Model.

Coroutines and Cooperative Scheduling

Coroutines — functions that can suspend and resume execution — provide the third concurrency primitive relevant to AI agents. Unlike threads (preemptive) or actors (message-driven), coroutines yield control cooperatively at explicit suspension points.

Python's async/await, Kotlin's coroutines, and JavaScript's generators all implement this pattern. For AI agents, coroutines map naturally to the "tool call" cycle:

async def agent_loop(task):
    while not task.complete:
        # Agent "thinks" — runs LLM inference
        action = await llm.generate(task.context)

        if action.type == "tool_call":
            # Suspend: agent yields while tool executes
            result = await execute_tool(action.tool, action.args)
            task.context.append(result)

        elif action.type == "escalate":
            # Suspend: agent yields, parent agent takes over
            result = await escalate_to_parent(action.reason, task)
            task.context.append(result)

        elif action.type == "delegate":
            # Spawn: create child agent for subtask
            child_result = await spawn_subagent(action.subtask)
            task.context.append(child_result)

The insight: an AI agent's execution is fundamentally a coroutine. It runs, hits an I/O boundary (LLM call, tool execution, human approval), suspends, and resumes when the result arrives. Frameworks that model agents as coroutines gain natural suspend/resume semantics.

Erlang/OTP: The Blueprint That AI Agents Need

Supervision Trees

Erlang's OTP framework introduced supervision trees — hierarchical structures where parent processes monitor and restart child processes on failure. This "let it crash" philosophy is the most important pattern that AI agent frameworks have yet to fully adopt.

                    Application Supervisor
                   /          |            \
           Agent Pool    Tool Executor    Memory Service
           Supervisor     Supervisor       Supervisor
          /    |    \        |    \            |
      Agent  Agent  Agent  Tool  Tool     State Store
        A      B      C    Exec  Exec

Supervision strategies in Erlang/OTP:

one_for_one: If a child crashes, restart only that child. Use when siblings are independent.
one_for_all: If any child crashes, restart all children. Use when siblings have interdependent state.
rest_for_one: If a child crashes, restart it and all children started after it. Use for sequential dependencies.

Mapping to AI agents:

Erlang Concept	AI Agent Equivalent
Supervisor	Orchestrator agent
Worker process	Subagent / tool executor
one_for_one restart	Retry failed subagent with same task
one_for_all restart	Restart entire agent group when shared context is corrupted
Escalation	Subagent reports failure to orchestrator for re-planning
Max restart intensity	Circuit breaker: stop retrying after N failures

The Mailbox Pattern

Every Erlang process has a mailbox — an ordered queue of incoming messages. The process selectively pattern-matches against messages, processing them one at a time. Messages that do not match remain in the mailbox for later processing.

This selective receive pattern is powerful for AI agents:

% Erlang-style selective receive for an AI agent
agent_loop(State) ->
    receive
        {task, Task} ->
            NewState = process_task(Task, State),
            agent_loop(NewState);

        {escalation_response, ParentDecision} ->
            NewState = apply_parent_decision(ParentDecision, State),
            agent_loop(NewState);

        {peer_message, FromAgent, Content} ->
            NewState = handle_peer_communication(FromAgent, Content, State),
            agent_loop(NewState);

        {tool_result, ToolName, Result} ->
            NewState = incorporate_tool_result(ToolName, Result, State),
            agent_loop(NewState);

        shutdown ->
            cleanup(State),
            ok
    after 30000 ->
        % Timeout: no messages for 30s, agent can do housekeeping
        agent_loop(maybe_checkpoint(State))
    end.

The selective receive allows an agent to prioritize certain message types — for instance, always processing escalation responses before new tasks, or handling shutdown signals immediately regardless of queue depth.

Process Lifecycle

Erlang processes have a well-defined lifecycle that maps cleanly to AI agents:

Spawn: Process is created with initial state and behavior. (Agent is instantiated with a system prompt, tools, and initial context.)
Running: Process consumes messages from its mailbox. (Agent processes tasks, makes LLM calls, uses tools.)
Waiting: Process blocks on receive, consuming no CPU. (Agent is suspended, waiting for external input — human approval, tool result, peer response.)
Linked/Monitored: Process is connected to its supervisor. (Agent reports status to its orchestrator; failures propagate up.)
Terminated: Process exits normally or crashes, supervisor is notified. (Agent completes task or fails; orchestrator handles the outcome.)

Modern AI Agent Implementations

Anthropic's Orchestrator-Worker Pattern

Anthropic's multi-agent research system provides the clearest bridge between actor model theory and AI agent practice. The architecture uses a lead agent (orchestrator) that spawns 3-5 specialized subagents in parallel:

User Query
    │
    v
┌─────────────────────┐
│   Lead Agent (Opus)  │  ← Orchestrator / Supervisor
│   - Analyzes query   │
│   - Develops strategy │
│   - Decomposes tasks  │
└─────┬───┬───┬────────┘
      │   │   │  spawn (parallel)
      v   v   v
   ┌────┐┌────┐┌────┐
   │Sub ││Sub ││Sub │  ← Worker actors
   │ A  ││ B  ││ C  │
   └──┬─┘└──┬─┘└──┬─┘
      │     │     │  return findings
      v     v     v
┌─────────────────────┐
│   Lead Agent         │  ← Synthesize, maybe spawn more
│   - Evaluates results│
│   - Identifies gaps  │
│   - Final synthesis  │
└─────────────────────┘

Actor Model alignment:

Each subagent has its own context window (isolated state — no shared memory)
Communication is through structured messages (task descriptions down, findings up)
The lead agent acts as a supervisor: if a subagent's results are insufficient, it can spawn replacement agents
Subagents have no knowledge of each other (no peer-to-peer communication in this architecture)

What is missing from the Actor Model:

No formal supervision strategy (no automatic restart on failure)
Subagents currently execute synchronously from the lead's perspective — true async execution is planned but not yet implemented
No mailbox pattern — subagents cannot receive messages after initial task assignment

Anthropic's Six Composable Patterns

Anthropic defines six building-block patterns for AI agents that map to concurrency primitives:

Prompt Chaining — Sequential pipeline (CSP channel chain)
Routing — Message dispatch by type (actor pattern matching / selective receive)
Parallelization — Fan-out/fan-in (actor spawn + collect)
Orchestrator-Worker — Dynamic supervision tree
Evaluator-Optimizer — Feedback loop (cyclic message passing between two actors)
Autonomous Agent — Self-directed actor with goal-seeking behavior loop

The orchestrator-worker pattern is the most actor-like: the orchestrator dynamically decides which tasks to create and which workers to activate, rather than following a fixed workflow graph.

LangGraph: State Machines with Checkpoints

LangGraph models agent workflows as directed graphs where nodes are processing steps and edges are transitions. Its key innovation for the actor model comparison is its checkpoint and interrupt system:

from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import AsyncPostgresSaver

# Define agent as a state machine
graph = StateGraph(AgentState)
graph.add_node("research", research_agent)
graph.add_node("review", human_review)
graph.add_node("synthesize", synthesis_agent)

# Add edges (transitions)
graph.add_edge("research", "review")
graph.add_edge("review", "synthesize")

# Compile with checkpointing — enables suspend/resume
checkpointer = AsyncPostgresSaver(conn)
app = graph.compile(checkpointer=checkpointer, interrupt_before=["review"])

# Execute — graph pauses before "review" node
thread = {"configurable": {"thread_id": "research-42"}}
result = await app.ainvoke(initial_state, thread)
# Agent is now suspended, state persisted to Postgres

# ... hours later, human provides input ...
from langgraph.types import Command
result = await app.ainvoke(
    Command(resume="Approved with edits: ..."),
    thread
)
# Agent resumes from checkpoint

Actor Model mapping:

Nodes are actors with defined behavior
Graph state is the shared mailbox (but writable by all nodes — this violates actor isolation)
Checkpoints provide the persistence that Erlang processes get from OTP's state recovery
Interrupts implement cooperative suspension (coroutine yield points)
Thread IDs provide actor identity for resumed execution

The gap: LangGraph agents communicate through a shared mutable state object, not through messages. This is fundamentally a shared-memory model, not a message-passing model. It works well for linear workflows but becomes harder to reason about as the number of concurrent nodes grows.

OpenAI Agents SDK: Handoffs as Message Routing

The OpenAI Agents SDK (successor to Swarm) implements a handoff pattern that mirrors actor-style message routing:

from agents import Agent, handoff

# Define specialized agents
refund_agent = Agent(
    name="Refund Specialist",
    instructions="Handle refund requests...",
)

sales_agent = Agent(
    name="Sales Agent",
    instructions="Handle sales inquiries...",
)

# Triage agent routes to specialists via handoffs
triage_agent = Agent(
    name="Triage",
    instructions="Route customer to the right specialist.",
    handoffs=[
        handoff(refund_agent),
        handoff(sales_agent),
        handoff(
            agent=escalation_agent,
            on_handoff=lambda ctx: log_escalation(ctx),
        ),
    ],
)

Actor Model mapping:

Agents are actors with named identity
Handoffs are typed messages that transfer conversation context (like Erlang's process migration)
The triage agent acts as a router/dispatcher — a common actor pattern
Escalation handoffs with callbacks implement the supervisor notification pattern

What is novel: The handoff transfers the entire conversation context to the receiving agent, which is closer to process migration than message passing. In a pure actor model, you would send a message containing the relevant context, not transfer the entire process state.

AutoGen: Conversational Actor Groups

Microsoft's AutoGen (now merged with Semantic Kernel as of October 2025) models agents as participants in group conversations — an asynchronous message-passing system:

from autogen import AssistantAgent, GroupChat, GroupChatManager

# Agents as actors with distinct roles
researcher = AssistantAgent("researcher", llm_config=llm_config)
analyst = AssistantAgent("analyst", llm_config=llm_config)
writer = AssistantAgent("writer", llm_config=llm_config)

# Group chat as a message bus
group_chat = GroupChat(
    agents=[researcher, analyst, writer],
    messages=[],
    max_round=12,
)

# Manager as supervisor / orchestrator
manager = GroupChatManager(groupchat=group_chat, llm_config=llm_config)

AutoGen's group chat is the closest implementation to actor-style message passing in current AI frameworks. Each agent maintains its own state, receives messages asynchronously, and can address messages to specific peers or broadcast to the group. The GroupChatManager acts as a supervisor, deciding which agent speaks next.

CrewAI: Role-Based Actor Assignment

CrewAI takes a role-based approach where agents are defined by their expertise, tools, and delegation capabilities:

from crewai import Agent, Task, Crew

researcher = Agent(
    role="Senior Researcher",
    goal="Find comprehensive data on the topic",
    tools=[search_tool, scrape_tool],
    allow_delegation=True,  # Can spawn/delegate to other agents
)

writer = Agent(
    role="Technical Writer",
    goal="Produce clear, accurate content",
    tools=[write_tool],
    allow_delegation=False,  # Leaf worker — no spawning
)

crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, writing_task],
    process="sequential",  # or "hierarchical" with a manager agent
)

In hierarchical mode, CrewAI adds a manager agent that acts as a supervisor — decomposing tasks, delegating to specialists, and synthesizing results. The allow_delegation flag determines whether an agent can spawn subtasks to peers, creating a partial supervision tree.

Critical Patterns for Agent Communication

Pattern 1: Escalation (Child to Parent)

When a subagent encounters a situation beyond its capabilities, it must escalate to its supervisor. This is the actor model's supervisor notification, inverted:

Subagent encounters unknown situation
    │
    ├─ Option A: Fail fast ("let it crash")
    │   └─ Supervisor detects failure, decides restart strategy
    │
    ├─ Option B: Explicit escalation message
    │   └─ Subagent sends {escalate, Reason, Context} to supervisor
    │   └─ Supervisor re-plans and may reassign or handle directly
    │
    └─ Option C: Handoff with context transfer
        └─ Subagent transfers full state to a more capable agent

In Anthropic's system, subagents return their findings along with confidence signals, and the lead agent decides whether the results are sufficient or whether additional research is needed. This is implicit escalation through result quality — the lead agent evaluates rather than the subagent explicitly requesting help.

OpenAI's handoff pattern provides explicit escalation: an agent can hand off to an "Escalation Agent" with a structured reason, which gets logged for audit trails.

Pattern 2: Peer Communication (Sibling to Sibling)

Most current AI agent frameworks do not support direct peer communication. Agents communicate through a shared state (LangGraph), through a group chat manager (AutoGen), or not at all (Anthropic's orchestrator-worker). This is a significant gap.

In actor systems, peer communication is fundamental — any actor can send a message to any other actor whose address it knows. For AI agents, peer communication enables:

Information sharing: A research agent discovers context relevant to a writing agent's task
Coordination: Multiple agents working on overlapping subtasks negotiate to avoid duplication
Negotiation: Agents with conflicting conclusions engage in structured debate

┌──────────┐  "I found relevant data"  ┌──────────┐
│ Research  │ ───────────────────────>  │ Writing  │
│ Agent    │                            │ Agent    │
└──────────┘  <───────────────────────  └──────────┘
               "Can you verify source X?"

AutoGen's group chat comes closest, allowing agents to address messages to specific peers within a shared conversation. But this is mediated by the GroupChatManager (a supervisor), not direct peer-to-peer messaging.

Pattern 3: External I/O (Agent to World)

Agents interact with external systems through tool calls — API requests, database queries, file operations, browser automation. In actor model terms, these are messages to external actors (services) with asynchronous response patterns.

The critical challenge is blocking: an LLM call takes seconds to minutes, a human approval might take hours or days. During this wait, the agent should not consume compute resources.

Solutions from classical concurrency:

Non-blocking I/O with futures (Akka pattern): Agent sends request, receives a future, continues processing other messages. When the future resolves, a result message arrives in the mailbox.
Suspend/resume (coroutine pattern): Agent execution suspends at the I/O point and resumes when the result is available. LangGraph's interrupt/resume implements this.
Durable execution (Restate pattern): The I/O call is journaled. If the agent crashes during the wait, it resumes from the journal without re-executing completed steps.

// Restate's durable execution: crash-safe tool calls
const toolResult = await restate_ctx.run("web_search", async () => {
    return await searchAPI.query(searchTerm);
});
// If agent crashes here and restarts, toolResult is replayed
// from the journal — the search is not re-executed

const llmResult = await restate_ctx.run("llm_call", async () => {
    return await llm.generate(prompt + toolResult);
});
// Same: LLM call result is persisted and replayed on recovery

Pattern 4: Lifecycle Management (Spawn/Suspend/Resume/Terminate)

A complete agent lifecycle model, synthesized from actor model patterns and current framework capabilities:

                    ┌─────────┐
           spawn    │         │  terminate
        ┌──────────>│ Created │──────────────┐
        │           │         │              │
        │           └────┬────┘              │
        │                │ initialize        │
        │                v                   │
        │           ┌─────────┐              │
        │    ┌─────>│         │──────┐       │
        │    │      │ Running │      │       │
        │    │      │         │      │       │
        │    │      └────┬────┘      │       │
        │    │           │           │       │
        │  resume   suspend     escalate     │
        │    │           │           │       │
        │    │      ┌────v────┐      │       │
        │    │      │         │      │       v
        │    └──────│Suspended│      │  ┌─────────┐
        │           │         │      │  │         │
        │           └─────────┘      └─>│Completed│
        │                               │         │
   ┌────┴─────┐                         └─────────┘
   │Supervisor│
   │ (Parent) │
   └──────────┘

State transitions:

Transition	Trigger	Actor Model Equivalent
Created -> Running	Supervisor spawns agent with task	`spawn_link(Module, init, Args)`
Running -> Suspended	Waiting for tool/human/LLM result	Process blocks on `receive`
Suspended -> Running	External result arrives	Message delivered to mailbox
Running -> Completed	Task finished successfully	Process exits normally
Running -> Completed (error)	Unrecoverable failure	Process crashes, supervisor notified
Completed -> Created	Supervisor restarts agent	Supervisor restart strategy

Virtual Actors: The Most Promising Model for AI Agents

Microsoft Orleans introduced the virtual actor model, which addresses the lifecycle management problem that traditional actors leave to the developer. In Orleans, actors ("grains") are:

Always conceptually alive: You never explicitly create or destroy a grain. You reference it by identity, and the runtime activates it on demand.
Automatically deactivated: When a grain has been idle, the runtime removes it from memory. When a new message arrives, it is transparently reactivated.
Location-transparent: The runtime decides which server hosts a grain. If a server fails, grains are automatically reactivated on surviving servers.
Single-activation guaranteed: Only one instance of a grain (per identity) exists at any time, eliminating distributed state conflicts.

This model maps remarkably well to AI agents:

Traditional Actor Model          Virtual Actor Model (Orleans)
─────────────────────────        ──────────────────────────────
Developer manages lifecycle      Runtime manages lifecycle
Explicit spawn/terminate         Reference by ID, auto-activate
Manual failover                  Automatic failover
Memory consumed while idle       Deactivated when idle
Location must be known           Location-transparent

For a system where thousands of AI agents need to exist — some active, some waiting for human input, some dormant until a scheduled trigger — virtual actors eliminate the operational burden of lifecycle management. An agent referenced by agent("research-task-42") would automatically activate when a message arrives, process it, and deactivate if idle, without any orchestration code.

Fault Tolerance Strategies

The "Let It Crash" Philosophy

Erlang's signature philosophy — do not write defensive code inside a process; instead, let it crash and have the supervisor handle recovery — is counterintuitive but powerful for AI agents.

Why it works for AI agents:

LLM calls are inherently non-deterministic. A retry often produces a better result than defensive error handling.
Tool calls can fail for transient reasons (rate limits, network issues). A clean restart with retry is more reliable than complex error recovery logic.
Agent state can become corrupted (hallucinated context, contradictory information). Restarting with a clean state and the original task is often the best recovery.

Implementation pattern:

class AgentSupervisor:
    def __init__(self, max_restarts=3, restart_window=60):
        self.max_restarts = max_restarts
        self.restart_window = restart_window
        self.restart_history = []

    async def supervise(self, agent_factory, task):
        while True:
            agent = agent_factory(task)
            try:
                result = await agent.run()
                return result  # Success
            except AgentError as e:
                self.restart_history.append(time.time())
                # Prune old restarts outside window
                cutoff = time.time() - self.restart_window
                self.restart_history = [
                    t for t in self.restart_history if t > cutoff
                ]
                if len(self.restart_history) > self.max_restarts:
                    # Too many restarts — escalate to parent
                    raise EscalationError(
                        f"Agent failed {self.max_restarts} times: {e}"
                    )
                # Otherwise, let it crash and restart
                log.warning(f"Agent crashed, restarting: {e}")
                continue

Durable Execution for Long-Running Agents

For agents that run for hours or days (research tasks, approval workflows, monitoring), crash recovery must not lose progress. Restate's durable execution pattern journals every side effect:

Each tool call result is persisted before the agent continues
On crash, the agent replays from the journal, skipping already-completed steps
Pending external calls (human approvals, scheduled triggers) survive agent restarts
The runtime guarantees exactly-once semantics for side effects

This is the distributed systems equivalent of a database transaction log, applied to agent execution.

Circuit Breakers and Backpressure

When an external service fails repeatedly, agents need circuit breaker patterns to avoid wasting tokens on doomed requests:

class ToolCircuitBreaker:
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failures detected, reject calls
    HALF_OPEN = "half_open"  # Testing if service recovered

    def __init__(self, failure_threshold=5, reset_timeout=60):
        self.state = self.CLOSED
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.last_failure_time = None

    async def call(self, tool_fn, *args):
        if self.state == self.OPEN:
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = self.HALF_OPEN
            else:
                raise CircuitOpenError("Tool unavailable, try later")

        try:
            result = await tool_fn(*args)
            if self.state == self.HALF_OPEN:
                self.state = self.CLOSED
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = self.OPEN
            raise

Backpressure — the ability to slow down message producers when consumers are overwhelmed — is naturally handled by CSP's synchronous channels but requires explicit implementation in actor systems. For AI agents, this means: if the orchestrator spawns more subagents than the system can handle (LLM rate limits, memory constraints), there must be a mechanism to queue or reject new agent requests rather than degrading the entire system.

Practical Implications for Multi-Agent Architecture Design

Communication Topology Recommendations

Based on the analysis of classical patterns and current implementations:

For hierarchical tasks (research, analysis, content creation):

Use the orchestrator-worker pattern (Anthropic's approach)
Orchestrator acts as supervisor with one-for-one restart strategy
Subagents are isolated workers with no peer communication
Results flow up; tasks flow down

For collaborative tasks (debate, code review, planning):

Use AutoGen-style group chat with a manager/supervisor
Enable peer communication through the managed channel
Manager handles turn-taking, conflict resolution, and termination
Consider adding explicit escalation paths to a human supervisor

For long-running workflows (approval chains, monitoring, scheduled tasks):

Use LangGraph-style state machines with checkpointing
Implement suspend/resume at every external I/O boundary
Persist state to durable storage (Postgres, not in-memory)
Design for resume-after-days: the agent should reconstruct full context from the checkpoint

For high-scale agent pools (thousands of concurrent agents):

Consider the virtual actor model (Orleans-style)
Agents activate on demand, deactivate when idle
Runtime handles placement, migration, and failover
Identity-based addressing eliminates the need for service discovery

What Current Frameworks Get Wrong

Shared mutable state instead of message passing. LangGraph's shared graph state is convenient but violates actor isolation. As the number of concurrent nodes grows, reasoning about state mutations becomes difficult.
No formal supervision strategies. Most frameworks have ad-hoc error handling. None implement Erlang-style supervision trees with configurable restart strategies, max restart intensity, or escalation chains.
Synchronous subagent execution. Anthropic's orchestrator-worker and many LangGraph implementations wait for all subagents to complete before proceeding. True actor systems allow the supervisor to process results as they arrive.
Missing backpressure. No current framework limits the rate of agent spawning based on system capacity. An orchestrator can spawn 50 subagents hitting the same LLM API, causing cascading rate limit failures.
Weak lifecycle management. Agents are either running or done. The suspended state (waiting for external input without consuming resources) is only partially implemented in LangGraph and Restate, and absent from CrewAI and AutoGen.

Designing a Communicating Agents Architecture

For a system where agents need escalation, peer communication, and external I/O — like a "shadow clone" architecture where the main agent spawns specialized workers — the recommended design combines patterns from multiple sources:

┌───────────────────────────────────────────────┐
│              Message Bus / Event Stream         │
│  (Kafka, Redis Streams, or in-process queue)   │
└──────┬──────────┬──────────┬─────────┬────────┘
       │          │          │         │
  ┌────v────┐┌────v────┐┌───v────┐┌───v────┐
  │Supervisor││  Agent  ││ Agent  ││External│
  │  Agent   ││  Pool   ││ Pool   ││  I/O   │
  │          ││(Research)││(Action)││ Bridge │
  └────┬─────┘└────┬────┘└───┬────┘└───┬────┘
       │           │          │         │
       │     ┌─────v──────┐   │    ┌────v─────┐
       │     │  Subagent   │  │    │  Tools   │
       │     │  Subagent   │  │    │  APIs    │
       │     │  Subagent   │  │    │  Human   │
       │     └─────────────┘  │    └──────────┘
       │                      │
       └─── Supervision ──────┘
            (restart, escalate,
             terminate)

Key design decisions:

Message bus for all communication. No shared state. Agents send typed messages through a central bus. This enables logging, replay, and debugging of all agent interactions.
Supervision hierarchy. Every agent has a supervisor. Supervisors implement configurable restart strategies. Escalation flows up the tree.
Agent pools with backpressure. Worker agents are pooled. The pool supervisor limits concurrency based on available resources (LLM rate limits, memory, cost budget).
External I/O bridge. All external interactions go through a dedicated bridge that handles retries, circuit breaking, and durable execution.
Virtual actor lifecycle. Agents activate on message receipt and deactivate when idle. State is checkpointed to durable storage for resume-after-crash.

Conclusion

The AI industry is building multi-agent systems by trial and error, when 40 years of distributed systems research has already mapped the territory. The Actor Model provides the communication and isolation primitives. Erlang/OTP provides the supervision and fault tolerance patterns. CSP provides backpressure and synchronization. Durable execution frameworks provide crash recovery for long-running agents. Virtual actors provide lifecycle management at scale.

The frameworks that will win the multi-agent race are not the ones with the best LLM integration — they are the ones that correctly implement these proven concurrency patterns. The current generation (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK) each implement fragments of the actor model. The next generation needs to implement it completely: isolated state, asynchronous message passing, supervision trees with configurable strategies, durable execution with checkpointing, and virtual actor lifecycle management.

For practitioners designing multi-agent systems today, the actionable advice is: study Erlang/OTP before studying LangChain. The patterns you need were invented in 1986.

Sources: Anthropic Engineering Blog, Akka Documentation, Erlang/OTP Design Principles, Microsoft Orleans Documentation, Restate Developer Documentation, LangGraph Documentation, OpenAI Agents SDK Documentation, Carl Hewitt's Actor Model papers, Tony Hoare's Communicating Sequential Processes