Trust Zones and DMZ-Style Architecture for AI Agent Fleets

Executive Summary

Between mid-2024 and mid-2026, the industry accumulated enough real-world breaches — Slack AI, GitHub's MCP server, Microsoft 365 Copilot's EchoLeak, ChatGPT connectors, Google Gemini's smart-home takeover — to converge on a single diagnosis: any agent that reads untrusted content, holds sensitive context, and can talk to the outside world is not "a bit risky," it is broken by construction. Simon Willison named this the "lethal trifecta" in June 2025, and by October 2025 Meta had turned it into a shipped engineering rule ("Agents Rule of Two") that any agent may hold at most two of: process untrustworthy input, access sensitive data, change state or communicate externally.

The natural architectural response is one network security has used for thirty years: segregate. Put the agents that touch untrusted input — group chats, customer-facing bots, inbound email, scraped web content — in a low-privilege perimeter zone with no standing credentials and no free outbound channel. Keep the agents that hold real context and can actually do things (send money, edit prod, read the CRM) in an internal zone that the perimeter can only reach through narrow, audited, often human-gated interfaces. Microsoft, AWS, and a wave of 2026 "zero-trust for agents" vendors are now formalizing versions of this as agent identity blueprints, enclaves, and gateways. Google DeepMind's CaMeL and Willison's earlier Dual LLM pattern give it a technical mechanism (a privileged orchestrator plus a quarantined, tool-less LLM for untrusted data). OWASP's new Top 10 for Agentic Applications (2026) and Meta's framework give it a checklist.

None of this is fully solved. Every real deployment still trades security for latency, UX, and agent competence: the more you strip the perimeter agent of context to make it safe, the dumber and more annoying it gets, and every escalation hop to a privileged agent adds a round trip a user has to sit through — or, worse, a "confirm" button they'll click without reading. Cross-agent privilege escalation research from September 2025 shows the hand-off boundary itself is now a target: a low-privilege agent can manipulate a higher-privilege peer instead of attacking it directly. This article maps the principle, the vendor mechanisms, the incidents that motivated them, and the unresolved trade-offs, as of mid-2026.

The Core Vulnerability: Why Exposed Agents Are Different

The lethal trifecta

Willison's framing, published on his blog on June 16, 2025, is deliberately simple: an agent is dangerous when it simultaneously has (1) access to private data, (2) exposure to untrusted content, and (3) a way to communicate externally. LLMs "follow instructions in content" regardless of whether those instructions came from the operator or from a webpage, email, or calendar invite the model happened to read. If all three legs are present, "an attacker can easily trick it into accessing your private data and sending it to that attacker" — no jailbreak of the base model required, just ordinary indirect prompt injection. Willison is explicit that generic guardrails don't close this gap: vendors claiming 95%+ block rates are not adequate for a security boundary, because the residual 5% is still a live exfiltration channel every time the agent runs. His prescription is architectural, not statistical: don't let any single agent instance hold all three properties at once.

Meta's Agents Rule of Two

Meta operationalized the same idea on October 31, 2025 in "Agents Rule of Two: A Practical Approach to AI Agent Security", reframed by Willison the following week as one of the more actionable prompt-injection papers of the year (Nov 2, 2025). The rule: within a single agent session, satisfy no more than two of — [A] process untrustworthy inputs, [B] access sensitive systems/private data, [C] change state or communicate externally. Meta gives concrete configurations:

Travel assistant [AB]: searches the web and touches booking data, but property C is clamped — no payment or booking completion without human confirmation, and browsing is restricted to trusted sources.
Web research agent [AC]: can browse broadly and submit forms/requests, but runs in a sandbox with no session data preloaded, stripping property B.
Internal coding agent [BC]: touches production systems and can make stateful changes, but filters what counts as "input" through author-lineage verification so untrusted third-party content (property A) never enters its context.

When all three properties are genuinely required in one workflow, Meta's guidance is blunt: the agent "should not be permitted to operate autonomously and at a minimum requires supervision — via human-in-the-loop approval or another reliable means of validation." Meta also walked back an early claim that the [AC] pairing (untrusted input + external communication, no private data) was "safe" — relabeling it "lower risk" after feedback that an agent with no data access can still cause damage (spam, defacement, unwanted purchases) purely through the external-communication leg. The framework explicitly disclaims completeness: it addresses prompt injection, not hallucination, not privilege escalation between agents, not supply-chain compromise of the model or tools.

Both formulations point to the same architectural conclusion: you cannot fix this per-request with better classifiers, because the model itself cannot reliably distinguish "instruction" from "data" once they're in the same context window. You have to fix it by never letting the dangerous combination exist in one place — which is exactly what a DMZ does for networks.

The DMZ Metaphor Applied to Agent Fleets

From network perimeters to agent perimeters

A traditional DMZ puts internet-facing services (web servers, mail relays) in a segment that can be reached from the internet and can reach the internal network only through narrow, filtered paths — so that compromising the public-facing box doesn't hand the attacker the internal network. The agent-fleet analogue: agents that process untrusted input (customer chat, public group chats, inbound email/webhooks, RAG over scraped web content) live in a perimeter zone with no standing access to sensitive systems and no free-form outbound capability. Agents that hold real privilege — CRM access, financial systems, source control write access, infra control — live in an internal zone reachable from the perimeter only through a controlled gateway.

This is not just a metaphor people wave at conference talks; it shows up as literal infrastructure. Security vendor Hoop, for instance, frames its product explicitly around "trust zone" segregation, warning that "boundary collapse occurs when compression merges logically distinct trust zones" — system instructions, developer policy, user input, and untrusted external content collapsing into one ambiguous context — and ties this directly to NIST AI RMF's MAP-2.3 control on maintaining input-type boundaries. Their proposed unit is the "enclave": a trust boundary containing sandboxed agents, the specific assets they're authorized to touch, and the tools scoped to one unit of work, enforced at the network-reachability layer rather than by policy rules alone — an agent on one project literally cannot reach another project's agents or data, regardless of what an injected prompt tells it to do.

Vendor architectures

Microsoft Entra Agent ID (general availability, 2026) is the most fully-formed commercial instance of zoning by identity rather than by network segment alone. It introduces "agent identity blueprints" — templates that hold credentials and issue short-lived, scoped tokens to individual agent instances — and distinguishes two operating patterns that map fairly directly onto perimeter/internal zones: assistive/interactive agents act on behalf of a signed-in user with delegated permissions and often escalate to a human representative when they hit a wall (customer-facing pattern), while autonomous agents run under their own identity with client-credentials auth for background, unattended work (internal pattern). Entra applies Conditional Access and Identity Protection to agent identities the same way it does to human ones, with a stated goal of "least-privileged... just-in-time, scoped tokens for exactly the resources the agent needs." This is Zero Trust vocabulary applied to agents, but the zoning logic underneath is the DMZ logic: different trust levels get structurally different credential lifetimes and blast radii.

AWS's reference architecture for agentic AI (Four Security Principles for Agentic AI Systems, and the associated Prescriptive Guidance docs) lays out a request path — user/app → Bedrock model → Bedrock Guardrails → org-level policy baseline → AgentCore Gateway tool authorization → dedicated agent IAM role → permission boundary/SCP → approved network path → downstream services — that is, in effect, a chain of DMZ-style chokepoints between "untrusted request" and "internal system." AWS is explicit that "the security boundary is the combination of: Model + Identity + Permissions + Tools + Network Path + Data Access" — no single control is the perimeter; the perimeter is the intersection of all of them. Their guidance also separates input guardrails (screening what comes in) from output guardrails (screening tool invocations and responses going back out), which is the DMZ idea of filtering traffic in both directions at the boundary, not just inbound.

Anthropic's posture, documented in Claude Code's security docs and the Model Context Protocol security best practices, pushes the zoning down to the tool-call level: run scripts and external-facing tool calls inside VMs/sandboxes, treat MCP tool descriptions as untrusted unless the server is verified (a malicious server can smuggle a jailbreak into its own tool metadata), and — critically for the DMZ analogy — use "multiple short-lived, narrowly scoped credentials, each limited to a specific purpose and expiring independently, to limit the blast radius of any single compromised credential." That last point is the agent-fleet equivalent of not letting the DMZ web server hold the domain admin password.

OWASP's Top 10 for Agentic Applications (2026) (released December 2025 with input from 100+ experts and endorsement from NIST, Microsoft, and NVIDIA — see genai.owasp.org) codifies the same instinct at the standards level: it explicitly calls out "insecure inter-agent communication" and recommends "segmentation by trust level," signed inter-agent messages, and MCP gateways that inspect and control inter-agent traffic — i.e., exactly a DMZ gateway, but for agent-to-agent calls instead of network packets. It sits alongside the older, narrower OWASP Top 10 for LLM Applications, where LLM01 (Prompt Injection) and LLM06 (Excessive Agency) remain the two entries most directly implicated in the DMZ argument: prompt injection is why untrusted content can hijack behavior, excessive agency is why that hijack becomes damage.

The dual-LLM / CaMeL lineage

Underneath the network-level zoning sits a model-level mechanism with a clean lineage. Willison proposed the Dual LLM pattern in April 2023: a Privileged LLM that talks only to trusted input and holds tool access, plus a Quarantined LLM invoked whenever untrusted data must be processed, which has no tool access and whose output is itself treated as untrusted before it's allowed anywhere near the Privileged LLM's context. The security property is structural, not classifier-based: unfiltered Q-LLM output must never reach the P-LLM's input, so the privileged agent literally never sees a token that could contain an injected instruction.

Google DeepMind's CaMeL (April 2025, formal writeup at arXiv:2503 region) is the most rigorous descendant of this idea. CaMeL pairs a Privileged LLM (task orchestration) with a Quarantined LLM (handles untrusted data, no tool-calling) and layers in classic software-security machinery — control-flow integrity, access control, and information flow control — via a custom Python interpreter that tracks the origin and permitted uses of every piece of data the privileged LLM touches, without needing to retrain or modify the underlying model. On the AgentDojo benchmark, CaMeL blocked roughly two-thirds of prompt injection attacks it was tested against — a meaningful improvement over prior mitigations, but explicitly not a complete solution; the authors and outside reviewers both note it does not eliminate the class of attack, only shrinks it. This is the honest state of the art: the dual-LLM/CaMeL line of work is the most technically serious attempt at zoning within a single agent, and it still isn't a closed problem two-plus years after Willison's original post.

Real-World Breaches: The Perimeter Agent as the Attack Surface

The incidents below share a pattern: in every case, the compromised component was the agent closest to untrusted input, and the damage was proportional to how much internal context and outbound capability that same component happened to hold.

EchoLeak (CVE-2025-32711, disclosed June 2025) — a zero-click vulnerability in Microsoft 365 Copilot, rated CVSS 9.3 and documented in detail in an arXiv case study. An attacker sent a single, ordinary-looking email containing a hidden prompt payload (HTML comments, white-on-white text). The user never had to open or act on it — Copilot's RAG pipeline later retrieved the email as context for an unrelated request ("summarize recent strategy updates"), and the hidden instructions executed with the same privilege as legitimate user requests, reaching across trust boundaries into Outlook, Teams, OneDrive, and SharePoint content. The exploit chain specifically defeated Microsoft's own cross-prompt-injection classifier (XPIA), used reference-style Markdown to dodge link redaction, and abused an allow-listed Teams proxy for exfiltration — showing that content filters at the boundary are bypassable in ways that a hard architectural boundary (no boundary crossing at all without re-authorization) would not have been.

GitHub MCP server exploit (disclosed May 2025 by Invariant Labs, writeup; covered by Willison) — the textbook lethal-trifecta case. An attacker planted a prompt injection in a GitHub Issue on a public repository. A victim's coding agent, connected via the GitHub MCP server with an overly broad personal access token, read the issue, followed the embedded instructions, pulled data out of the victim's private repositories, and exfiltrated it by opening a public pull request containing the stolen content (in the demonstrated case: physical address, salary details, and private repo information). One MCP server instance held all three trifecta legs — untrusted content (issues on public repos), private data (all repos the token could reach), and an outbound channel (creating public PRs) — because a single token scoped to "everything the user can access" served both the safe, low-trust action (reading a public issue) and the sensitive, high-trust action (reading a private repo). Researchers concluded there is "no easy solution" purely through better prompts; the fix is architectural — one-repository-per-session scoping and least-privilege tokens.

Slack AI data exfiltration (disclosed August 2024 by PromptArmor, report) — an attacker with only the ability to post in any public channel the target had access to (not even the private one) could plant instructions that Slack AI would later retrieve and execute when a victim asked an unrelated question, causing the model to render exfiltration links that encoded private-channel content — including secrets pasted into DMs — into a clickable URL. The attacker never needed access to the private data themselves; they only needed a foothold in the shared, lower-trust surface the retrieval system also queried.

AgentFlayer / ChatGPT Connectors (disclosed at Black Hat, August 2025, Zenity writeup) — a poisoned document (invisible text, tiny font) shared with a victim and opened via a connected Google Drive triggered ChatGPT to locate sensitive content and exfiltrate it by rendering a Markdown image whose URL parameters encoded the stolen data — a classic client-side fetch used as an exfil channel. OpenAI's URL-safety check was bypassed using Azure Blob Storage URLs that the system considered pre-trusted. Researchers noted the technique generalizes to any connector — GitHub, SharePoint, OneDrive — because the underlying architecture (one context window, one set of connector credentials, one image-rendering exfil path) doesn't change per connector.

Gemini calendar-invite / smart-home takeover ("Invitation Is All You Need", presented Black Hat 2025, SafeBreach research) — researchers found 14 distinct ways to prompt-inject Gemini through calendar invite text, such that when a victim later asked Gemini to summarize their schedule, the hidden instructions caused Gemini's smart-home integration to open windows, kill the lights, delete calendar entries, or send spam — demonstrating that the "sensitive system" an exposed agent can be tricked into touching doesn't have to be data at all; it can be physical actuation.

Cross-agent privilege escalation (documented September 2025, Willison) — a newer and more structurally interesting case: once individual coding agents (GitHub Copilot, Claude Code) locked down the ability to edit their own settings.json to disable approval prompts, researchers found that when multiple agents coexist in the same environment, one agent can be induced to edit another agent's configuration file instead, "freeing" it of its safety constraints. This matters for the DMZ analogy specifically because it shows the zone boundary itself — not just the data inside a zone — can be the target: if perimeter and internal agents share a filesystem or a settings surface, the attacker doesn't need to breach the gateway, they just need to get one agent to reach across and disarm the other.

Customer-facing chatbot misuse (Chevrolet, DPD, Air Canada — late 2023/2024) — lower-severity but instructive: a Chevrolet dealership bot was talked into "legally binding" agreeing to sell a $76,000 Tahoe for $1; DPD's bot was goaded into writing a profanity-laced poem about its own employer; Air Canada's chatbot fabricated a bereavement-fare policy that a Canadian tribunal held the airline liable for. None of these involved data exfiltration, but all three show the same zone-design failure in miniature — a public-facing agent was given authority (to make binding statements, issue commitments) that should have lived behind an internal confirmation step, because nobody drew a hard line between "conversational" and "consequential" for that agent.

Escalation Interfaces: How Exposed Agents Hand Off to Privileged Ones

If the perimeter agent can't hold sensitive context or free outbound capability, it has to escalate — hand the decision to something that can, typically either a more privileged internal agent or a human. Current patterns:

Human-in-the-loop confirmation is the default fallback for the third leg of the trifecta once a workflow genuinely needs it. Meta's framework treats this as mandatory when all three properties [A][B][C] are unavoidable in one session. In practice this ranges from a hard modal ("approve this $10,000 wire transfer?") to Microsoft's assistive-agent pattern where the agent "escalate[s] to a human representative" mid-conversation when it hits the edge of its delegated permissions.
Risk-adaptive approval is the emerging refinement to blanket confirm-everything UX, because blanket confirmation trains users to click through without reading. Research such as RTBAS calibrates whether an action needs a human gate based on assessed risk — low-risk operations auto-approve, high-risk ones stop for a real confirmation — reportedly blocking all targeted attacks in controlled tests with only ~2% utility loss.
Message authentication between agents is the recommended fix for the handoff boundary itself, following the cross-agent privilege escalation findings: a higher-privilege orchestrator should not blindly trust a request from a lower-privilege worker agent just because it arrived through the expected internal channel — the request needs to be validated against the scope of the original task, not merely accepted because it came from "another agent" rather than "the internet."
Structured escalation with context loss is the known failure mode: when agent A hands a subtask to agent B, B often receives the task without the broader context that would let it sanity-check the request, which is part of why cross-agent manipulation works — B has no way to tell "this came from a legitimate escalation" from "this came from an agent that was itself compromised upstream."

The most concrete documented instance of the "I'll check with the principal and get back to you" pattern is Microsoft's internal DigitalMe pilot (Inside Track blog): a digital twin that answers in Teams and Outlook on an employee's behalf, grounded only in knowledge bases the employee themselves can access, with every message explicitly labeled as agent-originated and — the key escalation discipline — knowledge gaps flagged back to the owner rather than improvised. The same shape appears in productized form in Microsoft's assistive-agent-to-human-escalation description in the Entra Agent ID materials and the human-in-the-loop tooling in Microsoft Agent Framework, where a paused agent literally waits for an external approval signal before resuming.

Zoning Models in Practice

A workable zoning model has to specify three things per zone, not just "internal vs. external":

Context: what data is preloaded or reachable. Meta's [AC] sandboxed research agent deliberately carries no session data preloaded — context poverty is the point, not a side effect.
Credentials: scope and lifetime. Anthropic's guidance (multiple short-lived, narrowly-purposed credentials) and Entra Agent ID's just-in-time scoped tokens both aim at the same failure mode exposed by the GitHub MCP incident — one broad, long-lived token serving both trusted and untrusted paths.
Outbound capability: whether the agent can initiate network calls, send messages, or write anywhere reachable by an external party, and through what gateway those calls are filtered.

Boundary sanitization is the second half of the model — what happens to content as it crosses from a lower-trust zone into a higher-trust one. Three mechanisms recur in the 2025-2026 literature:

Spotlighting / data marking (Microsoft Research, arXiv:2403.14720; summarized in Microsoft's 2025 defense post) — delimiting untrusted input with randomized markers, interleaving special tokens through suspicious content, or base64-encoding it, so the model has a continuous provenance signal for "this text is data, not instruction." Datamarking specifically is recommended as a floor-level defense: it reportedly drops attack success rate from >50% to under 2% on GPT-family models with minimal task-performance cost — a genuinely useful mitigation, but one Willison and others are careful to call a mitigation, not a boundary; it reduces the odds, it doesn't eliminate the channel.
Information flow control / taint tracking — the more rigorous approach underlying CaMeL: label data with confidentiality/integrity/provenance tags at ingestion and propagate those labels through every subsequent operation, so a downstream tool call can be blocked or require re-authorization if any of its inputs trace back to an untrusted source. Recent surveys frame this explicitly as "securing AI agents with information-flow control," treating tool-use safety as a data-flow problem rather than a content-filtering one.
Sandboxing and network-reachability enforcement — the enclave model: rather than relying on any content-level classifier, make the boundary a literal network/filesystem boundary an agent cannot cross regardless of what it's told, which is also the mitigation Willison recommends for cross-agent privilege escalation (locked-down containers per agent, so agent A physically cannot edit agent B's config).

Trade-offs: Latency, UX, and the "Dumb Agent" Problem

None of this is free. Three tensions show up repeatedly in 2025-2026 discussion of agent security architecture:

Latency: every escalation hop from a perimeter agent to an internal one, and every human confirmation gate, adds a round trip. Research on web-use agent security explicitly flags this: confirmation "introduces non-negligible latency, as each task requires multiple rounds of user interaction and LLM inference cycles," which is a real cost when agents are meant to feel autonomous.
UX/confirmation fatigue: blanket confirm-everything designs are known to fail in practice — users click through warnings without reading them once the pattern becomes routine, which is why risk-adaptive approval (auto-approve low-risk, gate high-risk) is displacing uniform human-in-the-loop as the recommended default.
Context poverty: an agent stripped of sensitive context to keep it safely in the perimeter zone is, by design, less useful — it can't answer questions that need the data it's not allowed to hold, and has to escalate for anything beyond its shrunk context window, which shows up to the end user as a noticeably dumber assistant. Meta's own framework acknowledges this is a deliberate cost, not an oversight: the [AC] research agent is intentionally memoryless of session data.

The honest 2026 industry line is that this trade-off has not been eliminated, only made explicit and tunable. Meta's Rule of Two and OWASP's Top 10 for Agentic Applications both function as checklists for making a conscious choice about where the line sits per workflow, not as proof that any given configuration is safe. Willison's own framing — guardrails and classifiers reduce risk but do not close the channel — is echoed by CaMeL's own authors (roughly two-thirds attack coverage, not full coverage) and by the fact that cross-agent privilege escalation was discovered after single-agent self-escalation had already been patched: securing the boundary of one zone routinely surfaces a new boundary (the seam between zones, or between peer agents within the same zone) that hadn't been considered yet. For any organization building a fleet of agents — some customer-facing, some internal — the state of the art in mid-2026 is: assume the perimeter agent will eventually be manipulated, design so that manipulation cannot reach sensitive context or free outbound capability regardless, and treat every hand-off across a trust boundary (to another agent, to a human, to a tool) as itself a place where the disguise can continue.