Zylos LogoZylos
2026-05-07

AI Agent Multi-Tenant Architecture: Isolation, Resource Governance, and Shared Infrastructure

ai-agentsmulti-tenantisolationresource-governanceplatform-architecture

Executive Summary

Multi-tenancy for AI agent platforms is categorically harder than traditional SaaS: agents execute arbitrary code, hold mutable in-memory state, invoke external tools with real-world side effects, and make LLM calls that can leak context across request boundaries. The isolation surface spans data, compute, credentials, and the model inference layer itself. Production platforms in 2025-2026 converge on a layered defense approach — namespace or container isolation for the runtime, vector-namespace or separate-database isolation for memory, credential vaulting with per-tenant scopes, and token-bucket rate limiting per tenant. The shared-vs-dedicated spectrum has no universal answer; the right position depends on threat model, latency sensitivity, and cost constraints.

The Multi-Tenancy Challenge for Agent Platforms

Traditional SaaS multi-tenancy is largely a data problem: ensure tenant A cannot read tenant B's rows. For AI agent platforms, the problem extends into four additional dimensions.

Stateful, long-running execution. Unlike a REST handler that finishes in milliseconds, an agent run can last minutes or hours, holding open LLM sessions, tool connections, and in-process memory. A pod recycled from one tenant to the next can silently bleed workspace files, cached secrets, or in-memory context if cleanup is not rigorous.

Ambient authority through tools. An agent is granted tool access — file system, web browser, code execution, external APIs — that dramatically amplifies the blast radius of a cross-tenant breach. A single namespace escape in a code-execution sandbox can reach the host and every other tenant running on the same node.

Model inference as a side channel. KV-cache sharing, a standard optimization in LLM serving infrastructure, creates a measurable timing side channel: time-to-first-token is detectably faster on a cache hit. An adversary controlling an agent can exploit this to reconstruct another tenant's cached prompt token by token. This is not theoretical — it appeared as a documented attack class in 2025.

Prompt and context contamination. When tenant context is passed into a shared model call without strict scoping, prompt injection can cause an agent to exfiltrate data it should not see, or to act on instructions from a different tenant's knowledge base.

These properties mean that multi-tenant agent platforms must think like operating systems — enforcing isolation at the process, memory, file system, network, and capability levels — not just like databases.

Isolation Layers

Data and Memory Isolation

Agent memory exists at multiple levels: ephemeral working memory within a run, conversation history, semantic long-term memory (vector stores), and structured state (relational or document databases).

Ephemeral workspace isolation is the first line. Each agent run should receive its own ephemeral workspace directory, destroyed in a finally block on completion. Background sweeps must prune stale directories, but sweeps must themselves be tenant-scoped to avoid one tenant's cleanup job reading another tenant's workspace. Kubernetes-based platforms address this via the Agent Sandbox project (kubernetes-sigs/agent-sandbox), which provisions isolated stateful pods per run and resets pod state between tenants.

Vector store isolation requires deliberate architecture. Mixing embeddings from multiple tenants in a single index is dangerous: without a strict per-query tenant_id filter, a similarity search can return a match from a different tenant's corpus. The two production patterns are:

  • Namespace isolation: Most vector databases (Pinecone, Qdrant, Weaviate) support namespaces or collections. Assign one namespace per tenant; stamp every vector with tenant_id; enforce the filter on every read path, including agent tool calls.
  • Separate indices: Higher isolation, higher cost. Appropriate for regulated tenants (healthcare, finance) where data commingling carries legal risk.

OWASP LLM Top 10 v2025 introduced LLM08:2025 specifically to address multi-tenant vector and embedding weaknesses, noting that authorization enforced at the vector retrieval layer can be bypassed when vector retrieval seeds a knowledge graph traversal — a subtler failure mode that requires authorization checks at both layers.

Relational/document store isolation follows the classic spectrum: shared schema with tenant_id predicates (cheapest, hardest to audit), shared database with separate schemas (better isolation), or separate database instances per tenant (maximum isolation, highest ops cost). For agent platforms, the operational overhead of separate databases is often justified for enterprise tenants because it eliminates the risk of a misconfigured query leaking data.

User-scoped persistent memory is an emerging pattern. Microsoft Azure AI Foundry (April 2026) introduced user-scoped persistent memory where each user's memory store is cryptographically scoped to their identity — agents access only the memory belonging to the authenticated principal. MCP servers are converging on a similar pattern: tool calls that read or write memory require a user-scoped credential, and the server enforces per-user isolation at the persistence layer.

Compute and Runtime Isolation

The compute isolation stack has four layers, each with a different security/performance tradeoff:

LayerTechnologyIsolation strengthCold start overhead
None (shared process)Thread-per-requestNone~0ms
NamespaceLinux namespaces + cgroupsProcess-level~50ms
User-space kernelgVisor (runsc)Syscall interception~200ms
MicroVMFirecracker / KVMHardware-level kernel~100-150ms
Full VMTraditional VMsFull hardware>1s

For multi-tenant agent platforms, the practical choice is between gVisor and Firecracker:

  • gVisor (Google's user-space kernel) intercepts all syscalls before they reach the host kernel, drastically reducing the attack surface. It is the default isolation layer in the Kubernetes Agent Sandbox standard and in GKE's agent sandbox offering. The dispatch order in OSS sandbox implementations typically tries: Firecracker → gVisor → bubblewrap → namespace → none.
  • Firecracker (AWS Lambda's MicroVM) launches each run inside its own Linux kernel on KVM, providing hardware-level separation with sub-150ms cold starts. This is appropriate when the threat model includes kernel exploits — a concern for platforms that execute LLM-generated code from untrusted tenants.
  • Bubblewrap (bwrap), which uses Linux user namespaces, is sufficient for single-developer or low-trust-delta scenarios (e.g., protecting against prompt injection in a controlled environment) but is insufficient for adversarial multi-tenant scenarios where attackers control agent inputs.

Production platforms run a warm pool of pre-initialized sandboxed pods to absorb cold start latency. The Kubernetes Agent Sandbox Warm Pool Orchestrator maintains a pool of pre-warmed gVisor or Kata Containers pods and assigns them to runs on demand, achieving sub-second effective cold starts. The critical operational requirement: warm pool recycling between tenants must flush all filesystem state, shared memory, and environment variables — a step that is often missed in early implementations, causing tenant data bleed.

Model Access and Credential Scoping

Model access isolation has two distinct problems: (1) which models a tenant can use, and (2) ensuring LLM inference for one tenant cannot leak into another tenant's context.

Per-tenant model entitlements are straightforward: maintain a tenant configuration record that specifies allowed model IDs, maximum context window, and any fine-tuned model variants. Gate every inference call through an entitlement check before dispatching to the model API.

Inference-level isolation is subtler. The KV-cache side channel (described above) means that platforms sharing an LLM inference cluster across tenants must either disable cross-tenant KV-cache sharing or accept the timing side channel. High-security deployments deploy dedicated inference endpoints per tenant — or per tenant tier (e.g., enterprise tenants get dedicated inference; free-tier tenants share, with the side channel risk disclosed).

Credential scoping for downstream tool access follows the principle of least privilege applied at tenant granularity. Each tenant should have:

  • A dedicated identity principal (service account, IAM role, or OIDC subject) with no cross-tenant trust relationships
  • Tool credentials (API keys, OAuth tokens) that are tenant-specific and stored in a vault with per-tenant access policies
  • No "super-role" that spans tenants — even the platform admin credentials should be scoped by operation, not by tenant bypass

The MCP specification (formalized November 2025) classifies MCP servers as OAuth Resource Servers and supports Resource Indicators (RFC 8707), enabling agents to explicitly declare the intended recipient of each access token. This is the emerging standard for credential scoping in agent tool calls: the agent states which resource it is accessing, the authorization server mints a scoped token, and the MCP server validates scope on every tool call.

Post-2025 security research consistently identifies broad persistent agent credentials as the primary attack vector in supply chain breaches — agents deployed with "super-user" access to company systems that could be hijacked by prompt injection or compromised tool servers. The remediation pattern is: discrete principal per agent identity, minimal permissions, complete audit trail.

Tool and Integration Isolation

Tools are the most consequential isolation domain because they have real-world side effects. A file-write tool can overwrite shared infrastructure. A code-execution tool can run arbitrary binaries. A web-browser tool can exfiltrate data to attacker-controlled endpoints.

Tool whitelisting at the tenant level: each tenant's agent configuration specifies a permitted tool set. The platform enforces this at the agent runtime layer — tools not in the whitelist are not exposed to the agent, regardless of what the system prompt requests.

Network egress control: agent containers should run on isolated Docker networks with egress routed through an allowlist proxy. An agent that can make arbitrary outbound HTTP calls is a data exfiltration vector. The proxy enforces per-tenant domain allowlists; by default, only the tool's declared endpoints are reachable.

File system scoping: filesystem access is scoped to the agent's ephemeral workspace directory. Bind mounts should not include host paths or cross-tenant persistent volumes. Symlink traversal attacks must be prevented by resolving all paths to canonical form before access checks.

Integration credential injection: tools that call external APIs (Slack, Salesforce, GitHub) must receive per-tenant credentials at injection time, not from a shared configuration. The pattern is: platform vault holds (tenant_id, integration_type) → credential; the tool executor fetches and injects at runtime; credentials are never stored in the agent's prompt or workspace.

Resource Governance

Token Quotas and Cost Allocation

Token quotas are the primary lever for cost governance in agent platforms. The production pattern uses a layered quota hierarchy:

  • Platform quota: the total token budget available from the model provider (e.g., AWS Bedrock TPM per account, Azure OpenAI tokens per subscription)
  • Tenant quota: a per-tenant allocation carved from the platform quota, configured in a tenant record
  • User/agent quota: optional sub-allocation within a tenant, useful for enterprise customers who want to budget by department or use case

AWS Bedrock's token burndown model illustrates the complexity: for Anthropic Claude 3.7+, output tokens consume quota at 5x the rate of input tokens. A naive per-request token count will dramatically underestimate actual quota consumption; quota systems must use the model-specific burndown multiplier.

Azure's agentgateway (AKS, April 2026) provides token rate limiting buckets with CEL expressions, enabling per-application token budgets enforced at the gateway layer before requests reach the model API. This is the right architectural position: enforce quotas at the infrastructure layer, not in application code, to prevent circumvention.

Cost allocation requires usage telemetry per tenant per model per time window. Each inference call should emit a structured event containing tenant_id, model_id, input_tokens, output_tokens, and timestamp. A streaming aggregation pipeline (Kafka + Flink, or simpler batch jobs) produces per-tenant cost dashboards and enables chargeback to enterprise customers.

Rate Limiting Strategies

Agent platforms face a rate limiting problem that is harder than standard API rate limiting because a single agent "task" may generate dozens of LLM calls, tool calls, and sub-agent invocations — the logical unit is the task, but the quota unit is the token or request.

Token bucket algorithm per tenant is the standard approach: a bucket with capacity C (burst limit) refills at rate R (sustained limit). Tokens are consumed on each LLM call; if the bucket is empty, the call is queued or rejected. The token bucket correctly handles bursty agent workloads — a complex task that requires many sequential LLM calls can burst through the bucket capacity, while sustained high-volume tenants are throttled to the refill rate.

Multi-level rate limiting for platforms serving both free and enterprise tenants:

  • Free tier: strict TPM and RPM limits, no burst headroom
  • Pro tier: higher sustained rate, moderate burst
  • Enterprise tier: negotiated quotas, dedicated capacity option

The multi-agent quota sharing problem (documented in detail in 2026): when multiple sub-agents within the same tenant run concurrently, they compete for the tenant's shared quota. Without coordination, agents can individually stay within their rate limit while collectively exhausting the quota. The solution is a shared ledger per tenant: each agent registers a soft reservation at task start; the ledger tracks total reserved + consumed capacity; idle agents donate unused capacity back to the pool. Azure's agentgateway and open-source implementations like Zuplo's token-based rate limiting provide reference architectures for this pattern.

Fair Scheduling Across Tenants

Fair scheduling prevents a high-volume tenant from starving others on shared infrastructure. The relevant algorithms:

Weighted fair queuing (WFQ): each tenant has a weight proportional to its tier or purchased capacity; the scheduler selects the next request from the tenant with the highest "debt" (difference between allocated share and actual consumption). This is the standard algorithm in network schedulers, adapted here to LLM inference slots.

Dominant Resource Fairness (DRF): for agent platforms where requests have heterogeneous resource profiles (some requests are compute-heavy due to long contexts; others are I/O-heavy due to many tool calls), DRF allocates resources such that no tenant envies another's allocation. Extended DRF work from Berkeley (EECS 2025-97) provides generalized formulations for heterogeneous workloads.

Priority lanes: agent platforms typically define 2-3 priority classes (interactive user-facing agents vs. background batch agents). Within each tenant, interactive requests are served before batch regardless of overall tenant priority. This prevents a tenant's batch jobs from degrading the latency of their user-facing agents.

The Shared-Dedicated Spectrum

Agent infrastructure exists on a spectrum from fully shared to fully dedicated. The decision involves trading off cost, isolation strength, customization, and operational complexity.

Fully shared runtime (one process pool serves all tenants): cheapest, highest efficiency, hardest to isolate. Acceptable only when tenants are internal teams within the same organization with symmetric trust, or when the attack surface is strictly controlled (e.g., agents can only access read-only APIs).

Shared runtime with sandbox isolation (shared pool, per-run gVisor or Firecracker container): the dominant production pattern for public multi-tenant platforms. Each agent run is isolated at the container or MicroVM level; the pool is shared for scheduling and warm pool efficiency. This is the pattern used by AWS Bedrock AgentCore, GKE Agent Sandbox, and most cloud-native agent platforms. The tradeoff: isolation is strong for compute, but model inference and vector stores may still be shared infrastructure with logical (not physical) isolation.

Dedicated sandbox pool per tenant: each tenant has its own warm pool of pre-initialized containers. Higher cold start efficiency within the tenant (warm pods are never shared), no cross-tenant pod recycling risk, but pool utilization drops for low-volume tenants. Appropriate for enterprise tier customers with predictable load.

Dedicated inference endpoint: the tenant has an exclusive LLM inference endpoint (dedicated GPU allocation or dedicated serverless slot). Eliminates KV-cache side channels and guarantees latency SLAs regardless of other tenants' load. Cost-prohibitive for most tenants; reserved for top-tier enterprise or regulated industry deployments.

Fully dedicated deployment (private VPC or on-premises): the agent platform itself is deployed exclusively for one tenant — their own nodes, their own model deployments, their own storage. CrewAI Enterprise's AOP Suite and LangGraph's private cloud offer support this model. The tenant bears the full infrastructure cost but achieves maximum customization and compliance guarantees. This is the pattern used by financial institutions and healthcare organizations with strict data residency requirements.

The practical recommendation is to offer tiered isolation levels as a pricing dimension: shared sandbox (free/entry tier), dedicated sandbox pool (pro tier), dedicated inference (enterprise tier), fully dedicated (regulated/strategic tier). Customers self-select based on their compliance requirements and budget.

Production Patterns

AWS Multi-Tenant Agentic AI Prescriptive Guidance (published 2025, updated 2026) provides the most complete public reference architecture for multi-tenant agent platforms on cloud infrastructure. Key patterns: tenant context propagation through all layers via a signed JWT claim, per-tenant IAM role assumption for tool execution, separate Bedrock Knowledge Bases per tenant with query-time tenant isolation enforced at the retrieval layer, and CloudWatch dashboards keyed by tenant_id for cost allocation.

Blaxel's multi-tenant agent platform documents a three-zone architecture: the control plane (tenant management, auth, quota enforcement) is fully shared; the agent orchestration layer uses namespace isolation with gVisor sandboxes; the data plane (vector stores, tool integrations) is logically isolated by tenant namespace with an option to upgrade to physically isolated databases for enterprise tenants.

LangGraph 1.0 (GA October 2025) stabilized durable execution as a first-class primitive — agents that survive server restarts and resume from exact checkpoints. This makes multi-tenant state management more tractable because checkpoints are keyed by thread_id, and thread ownership can be enforced per tenant. However, LangGraph itself does not enforce multi-tenant isolation — that is the platform's responsibility. LangGraph Cloud's hosted offering implements namespace isolation at the deployment level; self-hosted deployments must implement tenant isolation in the application layer.

MCP server security is a new frontier in 2025-2026 as MCP becomes the standard tool protocol. The November 2025 MCP authorization specification introduces per-resource token scoping. Platforms should enforce: (1) MCP server registration is tenant-scoped (tenants cannot use each other's MCP servers); (2) access tokens for MCP tool calls are minted with explicit resource and tenant scope; (3) MCP server implementations validate both the resource scope and the tenant claim on every tool call.

Agent-level RBAC is evolving beyond simple role assignments. A 2026 analysis from tianpan.co argues that traditional RBAC is insufficient for agents because agents have transitive authority — granting an agent access to a tool effectively grants it access to everything that tool can reach. The emerging model is capability-based access control for agents: each agent is granted a set of explicit capabilities (read-file-in-workspace, call-slack-api-for-tenant-X), capabilities are non-transferable, and the runtime enforces the capability set regardless of what tools the agent discovers through MCP.

Architectural Recommendations

For platform builders evaluating multi-tenant agent architecture, a decision framework by concern:

Start with the threat model. Are tenants mutually untrusting organizations (SaaS), or internal teams (internal tooling)? For internal tooling, namespace isolation + shared inference is usually sufficient. For SaaS, minimum bar is gVisor/Firecracker + separate vector namespaces + per-tenant credential vaults.

Choose isolation at the right layer for each resource type:

  • Compute: gVisor sandboxes for cost-efficient isolation; Firecracker for maximum security
  • Vector memory: separate namespaces minimum; separate indices for regulated tenants
  • Relational state: shared schema acceptable for low-risk data; separate schemas or databases for sensitive data
  • Model inference: shared with KV-cache disabled for security; dedicated endpoints for latency SLA guarantees
  • Tool credentials: always per-tenant, always vault-managed, never in prompts or environment variables

Implement quota enforcement at infrastructure, not application layer. Application-layer quota checks can be bypassed by bugs or prompt injection. Rate limiting and token quotas belong in an API gateway or service mesh that the application cannot circumvent.

Propagate tenant context as a signed claim through all layers. Every service call — LLM inference, vector retrieval, tool execution, state read/write — should carry the tenant identity as a signed, validated claim. This enables both security enforcement and cost attribution at every layer.

Design for warm pool recycling safety. The most common security regression in agent platforms is incomplete cleanup during pod recycling. Implement explicit, audited cleanup procedures: destroy workspace directories, flush environment variables, rotate ephemeral credentials, and run a verification scan before returning a pod to the pool.

Audit the vector retrieval authorization gap. When agents use vector retrieval to seed knowledge graph traversal or to generate context for LLM calls, authorization enforced at the vector layer does not automatically extend to downstream layers. Implement authorization checks at each layer independently (defense in depth), not only at the entry point.

Expose isolation tier as a first-class configuration dimension. Tenants should be able to choose and pay for their isolation level. This avoids the architectural anti-pattern of building to the highest isolation requirement for all tenants, which makes the platform unaffordable for lower-risk use cases.

References