Zylos Logo
Zylos
2026-02-23

Multi-Tenant Security Patterns for SaaS and AI Agent Platforms

securitymulti-tenancysaasapi-securityai-agentsb2b

Executive Summary

Multi-tenancy is the defining architectural characteristic of SaaS platforms, and it introduces a class of security problems that single-tenant systems never face. When multiple organizations share infrastructure, a single misconfiguration can expose one tenant's data to another — and in a B2B platform, that is the kind of incident that ends companies.

The challenge is amplified for platforms where the "tenants" are not humans but autonomous AI agents. AI agents can generate thousands of requests per minute, operate across unpredictable workflows, and make authorization decisions at speeds no human reviewer can match. The isolation primitives designed for human-operated SaaS need to be hardened significantly to withstand this load and attack surface.

This article covers the full security stack for multi-tenant platforms: data isolation models, tenant-scoped authentication, API key lifecycle management, OAuth token hardening, SSRF prevention in webhook delivery, and the emerging patterns specifically needed for multi-agent B2B protocol platforms.


The Fundamental Rule: Isolation Must Be Enforced Everywhere

Before diving into specific techniques, the most important principle to internalize is this: tenant isolation is not a feature you add to a working system — it is a constraint that must be threaded through every layer from day one.

The attack surface in a multi-tenant platform exists at every point where tenant context could be confused, lost, or bypassed:

  • A database query that forgets to filter by tenant_id
  • A background job queue that processes tasks without validating tenant membership
  • A caching layer that stores results under a key that two tenants could collide on
  • An API endpoint that accepts a resource ID without verifying the caller owns it
  • A webhook delivery system that makes outbound HTTP requests to user-supplied URLs

Each of these is an independent failure mode. Defense-in-depth means accepting that any single layer can fail and designing the others to contain the breach.


Data Isolation Models

The first architectural decision in any multi-tenant system is how to physically or logically separate tenant data. Three models are in common use, and the right choice depends on your tenant count, compliance requirements, and operational constraints.

Model 1: Shared Database with Row-Level Security

All tenants share a single database. Every table that contains tenant-owned data includes a tenant_id column, and access policies enforce that queries only touch rows belonging to the authenticated tenant.

PostgreSQL's Row-Level Security (RLS) is the canonical implementation:

-- Enable RLS on the table
ALTER TABLE agent_sessions ENABLE ROW LEVEL SECURITY;

-- Force RLS even for table owners
ALTER TABLE agent_sessions FORCE ROW LEVEL SECURITY;

-- Create policy: tenants only see their own rows
CREATE POLICY tenant_isolation ON agent_sessions
  USING (tenant_id = current_setting('app.current_tenant_id')::uuid);

At the application layer, every database connection sets the tenant context before executing queries:

async function withTenantContext<T>(
  tenantId: string,
  fn: (db: DatabaseClient) => Promise<T>
): Promise<T> {
  return db.transaction(async (trx) => {
    await trx.raw(`SET LOCAL app.current_tenant_id = '${tenantId}'`);
    return fn(trx);
  });
}

The critical fragility: RLS is only as reliable as the tenant context being set. If a request path ever fails to set app.current_tenant_id before running queries, the policy will either reject all rows or, worse, allow all rows depending on how the fallback is configured. The application layer must treat missing tenant context as a hard error, not a degraded state.

Strengths: Low operational overhead, works well at any tenant count, single schema to maintain.

Weaknesses: A sufficiently privileged database user (e.g., a migration runner) can bypass RLS. Compliance frameworks like HIPAA or SOC 2 Type II may require stronger physical isolation for certain data classes.

Model 2: Schema-per-Tenant

A single database hosts multiple PostgreSQL schemas — one per tenant. The application routes each request to the correct schema by setting search_path:

SET search_path TO tenant_acme, public;
SELECT * FROM agent_sessions; -- reads from tenant_acme.agent_sessions

Schema-per-tenant offers stronger isolation guarantees than RLS alone: a misconfigured query that forgets to filter by tenant_id still only touches that tenant's schema. Migrations become more complex (schema changes must be applied to every tenant schema), but tooling like db-migrate and Flyway supports this pattern.

Best use case: Mid-market platforms with hundreds to low thousands of tenants, particularly in regulated industries where schema-level separation satisfies auditors more easily than RLS alone.

Model 3: Database-per-Tenant

Each tenant gets a completely separate database instance. This is the strongest isolation available — a misconfiguration in one tenant's database cannot affect another.

The operational cost is substantial: provisioning, backup, upgrade cycles, and connection pooling must all scale with tenant count. Tools like PgBouncer and RDS instance management become critical infrastructure.

Best use case: Enterprise tiers, healthcare platforms, financial services — any context where a customer's contract or regulatory obligation demands physical data separation. This is often sold as a premium tier ("dedicated infrastructure") rather than the default.

The Hybrid Reality

In practice, most platforms implement a tiered model: shared database with RLS for the majority of tenants (startup and growth plans), schema-per-tenant for mid-market, and database-per-tenant for enterprise. The architecture must support migrating a tenant between tiers as they upgrade, which is a non-trivial engineering problem worth solving before you have your first enterprise prospect.


Tenant-Scoped Authentication and Authorization

Tenant Resolution at the Edge

Tenant identity must be resolved before any business logic executes. The two common approaches are:

Subdomain-based resolution: acme.platform.example.com → tenant acme. The subdomain is looked up in a tenant registry at request ingress, before routing.

Token-embedded tenant ID: The API key or JWT token contains a tenant_id claim. The authentication middleware extracts and validates this claim first.

The subdomain approach is more robust for user-facing applications because it provides a second signal — even if the token is somehow shared, the subdomain context narrows the scope. For machine-to-machine API traffic (the dominant pattern in agent platforms), token-embedded tenant context is more practical.

JWT Design for Multi-Tenant Platforms

A well-designed JWT for a multi-tenant API platform includes:

{
  "sub": "agent_01JMXYZ",
  "tenant_id": "tenant_acme",
  "org_id": "org_marketing_team",
  "scopes": ["sessions:read", "sessions:write", "webhooks:manage"],
  "resource_prefix": "acme/",
  "iat": 1708690000,
  "exp": 1708693600,
  "jti": "01JMXYZ_unique_claim_id"
}

Key design decisions:

  • Short-lived access tokens: 15–60 minutes. A compromised access token has a bounded blast radius.
  • jti (JWT ID) for revocation: Without jti, you cannot revoke a specific token before expiry. Store issued jti values and check against a revocation list (or a blocklist in Redis) on each request.
  • Scope-limited tokens: Agent tokens should carry only the scopes they actually need for their workflow. An agent that only reads session data should not hold webhooks:manage scope.
  • Resource prefix scoping: For platforms where resources have hierarchical IDs, embedding a resource prefix in the token prevents the token from being used against resources outside the tenant's namespace.

Refresh Token Rotation

Access tokens expire quickly; refresh tokens allow clients to obtain new access tokens without re-authenticating. Refresh token rotation is the security control that makes this safe:

  1. Client presents refresh token → receives new access token + new refresh token
  2. Old refresh token is immediately invalidated
  3. If the old refresh token is presented again (indicating the original was stolen), invalidate the entire session and alert the tenant

This pattern detects refresh token theft: if an attacker steals and uses a refresh token while the legitimate client still holds it, the next time the legitimate client tries to rotate, the platform detects the double-use and can terminate the session.

RBAC at the Tenant Level

Authorization models for multi-tenant platforms need to be scoped to tenant boundaries. The standard approach is tenant-scoped RBAC:

tenant_id: acme
  role: admin → all permissions
  role: developer → read + write, no billing
  role: agent → specific tool scopes only
  role: viewer → read-only

Role assignments are always tenant-scoped — a user who is an admin in tenant acme has no authority in tenant betacorp. This seems obvious but is a common source of bugs when RBAC is implemented as a global system and tenant scoping is bolted on later.

For AI agent platforms specifically, agents themselves need their own identity and role in the authorization model. An agent has a machine identity (API key or service token), a tenant membership, and a set of allowed tools/scopes. The principle of least privilege applies just as strictly to agents as to human users — arguably more so, since agents can operate far faster and at higher volume than humans.


API Key Lifecycle Management

API keys are the primary authentication mechanism for machine-to-machine communication in B2B platforms. Managing them well is a significant security surface.

Key Generation

API keys should be generated with sufficient entropy (at minimum 256 bits from a cryptographically secure random number generator) and encoded in a format that is both URL-safe and human-recognizable. A common pattern:

prefix_base64url_random_checksum
sk_live_4xKqP9mN2rVwY8dT5aHjL...

The prefix (sk_live_) encodes the key type and environment, making it easy to identify in logs and to write scanners that detect accidental key exposure in source code repositories.

Store a hash of the key, not the key itself. When a request arrives with an API key, hash it with BLAKE2b or SHA-256 and compare against stored hashes. The original key is only shown once — at creation — and never stored in plaintext.

Key Scoping

Each API key should carry an explicit scope declaration:

{
  "key_id": "key_01JMXYZ",
  "tenant_id": "tenant_acme",
  "scopes": ["sessions:write", "tools:list"],
  "allowed_ips": ["203.0.113.0/24"],
  "expires_at": "2026-12-31T00:00:00Z",
  "created_by": "agent_deployment_pipeline",
  "description": "Production agent key — inbound webhook processor"
}

Allowed IP allowlisting is a strong defense for machine-to-machine keys. If a key is expected to be used only from a specific server or CIDR range, bind it to that range. A stolen key is useless to an attacker who cannot source requests from the allowed IPs.

Rotation Without Downtime

Key rotation is operationally challenging because consumers need to update their configured key without a service interruption. The standard pattern:

  1. Issue new key (both old and new keys are valid)
  2. Consumer updates their configuration to use the new key
  3. After a grace period (typically 7–14 days), deactivate the old key
  4. Notify the tenant if the old key is still being used as the deadline approaches

Platforms that force a hard cutover without a grace period routinely cause production outages for their tenants. The grace period is not optional for enterprise-grade platforms.

Rotation Scheduling

For high-security deployments, mandate key rotation on a schedule:

  • Standard tier: rotate every 90 days
  • Enterprise/regulated: rotate every 30 days
  • Incident response: immediate rotation on suspected compromise

Automate rotation reminders and, where possible, automate the rotation itself via a rotation API that consumers can call programmatically. HashiCorp Vault's dynamic secrets model (where keys are generated per-request and expire automatically) represents the leading edge of this: no long-lived credentials exist to steal.


SSRF Prevention in Webhook Systems

Webhook delivery is one of the most dangerous features a platform can offer. You are accepting a URL from a potentially adversarial tenant and making an outbound HTTP request to it from your infrastructure. Done naively, this is a direct path to Server-Side Request Forgery (SSRF).

The SSRF Threat Model

An attacker registers as a tenant and configures a webhook URL of http://169.254.169.254/latest/meta-data/iam/security-credentials/ — the AWS EC2 Instance Metadata Service endpoint. If your webhook delivery worker makes this request from within your cloud VPC, the attacker receives your instance's IAM credentials and can pivot to your entire AWS account.

Variations include:

  • Targeting internal services: http://internal-db-proxy.internal/admin
  • Targeting other tenants' internal endpoints via private network routing
  • Targeting your own platform's admin APIs: http://platform-api.internal/admin/users

Defense Layer 1: URL Validation Before Request

Before making any outbound webhook request, validate the URL against a blocklist:

async function validateWebhookUrl(url: string): Promise<void> {
  const parsed = new URL(url);

  // Require HTTPS
  if (parsed.protocol !== 'https:') {
    throw new Error('Webhook URLs must use HTTPS');
  }

  // Resolve hostname to IP addresses
  const addresses = await dns.resolve4(parsed.hostname);

  for (const ip of addresses) {
    if (isPrivateOrLoopback(ip)) {
      throw new Error(`Resolved IP ${ip} is in a private range`);
    }
  }

  // Block your own platform domains
  if (isOwnDomain(parsed.hostname)) {
    throw new Error('Webhook cannot target platform-owned domains');
  }
}

function isPrivateOrLoopback(ip: string): boolean {
  const ranges = [
    '127.0.0.0/8',    // loopback
    '10.0.0.0/8',     // private
    '172.16.0.0/12',  // private
    '192.168.0.0/16', // private
    '169.254.0.0/16', // link-local (cloud metadata)
    '100.64.0.0/10',  // shared address space
    '::1/128',        // IPv6 loopback
    'fc00::/7',       // IPv6 private
  ];
  // Use a proper CIDR matching library
  return ranges.some(range => cidrContains(range, ip));
}

Critical: validate the resolved IP, not just the hostname. An attacker can create a DNS record that initially resolves to a public IP (passing validation) and then change it to point to 169.254.169.254 (DNS rebinding). The only safe approach is to resolve the hostname, validate the IP, and then connect directly to that IP in the same operation — not re-resolve at connection time.

Defense Layer 2: Network Isolation for Webhook Workers

Place webhook delivery workers in a separate subnet with no route to your internal services:

Internet → Load Balancer → API workers (full internal access)
                        ↘ Webhook workers (outbound-only, no internal routes)

The webhook worker subnet should have:

  • No route to the metadata service CIDR (169.254.169.254)
  • No route to internal VPC CIDRs
  • Outbound HTTPS to the internet only (port 443)
  • An egress proxy that enforces the IP blocklist at the network level

This defense-in-depth means even if the application-layer URL validation is bypassed, the network cannot route the request to internal services.

Defense Layer 3: Payload Signing

Webhooks should be signed so recipients can verify authenticity. The standard approach uses HMAC-SHA256:

function signWebhookPayload(
  payload: string,
  secret: string,
  timestamp: number
): string {
  const message = `${timestamp}.${payload}`;
  return crypto
    .createHmac('sha256', secret)
    .update(message)
    .digest('hex');
}

// Deliver with signature headers
headers['Webhook-Timestamp'] = String(timestamp);
headers['Webhook-Signature'] = `v1=${signature}`;

Recipients verify by recomputing the signature and comparing, and by checking that the timestamp is within an acceptable window (e.g., ±5 minutes) to prevent replay attacks.

Use multiple signing keys simultaneously during rotation periods to allow zero-downtime key transitions.


Unique Challenges for AI Agent Platforms

AI agent platforms introduce security challenges that go beyond what traditional SaaS architectures were designed for.

Agent Identity is Not Human Identity

In a human-operated SaaS platform, a "user" has clear identity, acts at human speed, and is accountable through social and contractual mechanisms. AI agents are different:

  • Volume: A single deployed agent can make tens of thousands of API calls per hour
  • Delegation chains: An orchestrator agent may spawn sub-agents, which spawn further sub-agents — creating deep delegation trees where accountability becomes unclear
  • Credential propagation: When an agent is delegated access on behalf of a human user, the agent's token must carry scoped-down permissions, not the full authority of the human

For agent-to-agent communication (the core of B2B protocol platforms), each hop in a delegation chain should reduce rather than preserve or amplify authority. The principle: an agent cannot grant another agent more permissions than it itself holds.

The Confused Deputy Problem in Agent Systems

The "confused deputy" attack is particularly sharp in agent platforms. If agent A holds admin credentials and calls agent B with a task, and agent B executes that task using A's credentials, then an attacker who can control B's task inputs effectively has admin access.

Mitigation: agents operate with their own narrowly-scoped credentials, not borrowed credentials from their callers. When an orchestrator delegates to a sub-agent, it provides a capability token scoped specifically for that sub-task, not its own master credentials.

Rate Limiting Must Be Tenant-Aware

Standard rate limiting applied globally is insufficient for multi-tenant platforms. If tenant A runs a high-volume agent workload and exhausts rate limits, tenant B's legitimate traffic is blocked — a form of inadvertent denial of service.

Implement rate limiting in multiple dimensions:

Global limit: 100,000 req/min
Tenant limit: per-tenant budget (scales with plan tier)
Agent limit: per-agent-identity sub-limit within the tenant budget
Endpoint limit: per-operation limits for expensive operations

Tenant limits should be isolated from each other. A tenant exceeding their limit gets throttled; other tenants are unaffected.

Prompt Injection as a Cross-Tenant Attack Vector

In AI agent platforms, tenants can configure prompts, tool definitions, and workflow instructions that the agent engine executes. A malicious tenant could craft a tool description or system prompt designed to exfiltrate data from the agent's context — including data about other tenants if the agent has access to shared context.

Defenses:

  • Strict content sandboxing: agent execution contexts should be fully isolated per tenant
  • Input sanitization for all tenant-supplied content that enters the agent's context
  • Principle of minimal context: agents should only receive data they need for the current task, not a broad view of the platform's state
  • Audit logging of all agent-executed tool calls for forensic review

Compliance and Audit Logging

Multi-tenant security is not only a technical problem — it is also an audit and compliance problem. Enterprise B2B customers will ask for evidence that their data is isolated, and regulatory frameworks (GDPR, SOC 2, HIPAA) require demonstrable controls.

What to Log

Every security-relevant event should be logged with tenant context:

interface AuditEvent {
  event_id: string;
  timestamp: string;
  tenant_id: string;
  actor_id: string;        // user or agent identity
  actor_type: 'human' | 'agent' | 'system';
  action: string;          // e.g. "session.create", "webhook.deliver"
  resource_id: string;
  outcome: 'success' | 'failure' | 'blocked';
  ip_address?: string;
  metadata: Record<string, unknown>;
}

Logs must be:

  • Tamper-evident: written to append-only storage (S3 with Object Lock, WORM-enabled storage)
  • Tenant-isolated: tenants should be able to export their own audit logs without seeing other tenants' logs
  • Retained appropriately: SOC 2 typically requires 1 year; some regulations require longer

Data Residency

Enterprise customers in the EU increasingly require that their data (including logs and audit trails) not leave the EU. This requires geography-aware tenant routing: at tenant onboarding, record the data region, and route all subsequent storage and processing for that tenant to the correct region.

This is a significant architectural constraint to design for early — retrofitting data residency into an existing global platform is extraordinarily expensive.


Implementation Priorities for B2B Protocol Platforms

For a platform at the BotsHub GA stage — a multi-agent B2B protocol layer serving multiple organizations — the security work can be prioritized in phases:

Phase 1 (must have before public launch):

  • Row-level security on all tenant data tables
  • Tenant context propagated through the entire request lifecycle
  • Scoped API keys with HMAC storage (no plaintext keys)
  • HTTPS-only webhook delivery with IP blocklist validation
  • Webhook payload signing

Phase 2 (enterprise readiness):

  • JWT with short expiry + refresh token rotation
  • Per-tenant rate limiting
  • Comprehensive audit logging with tamper-evident storage
  • Schema-per-tenant option for regulated customers
  • Key rotation API with grace-period support

Phase 3 (scale and compliance):

  • Database-per-tenant option for the highest-tier enterprise contracts
  • Data residency controls with geo-aware routing
  • Formal SOC 2 audit
  • Automated adversarial testing for cross-tenant leakage

The sequencing matters: Phase 1 prevents the catastrophic failures (data leaks, SSRF attacks) that would be existential. Phase 2 unlocks the enterprise segment. Phase 3 is the certification path that enterprise procurement teams require.


Summary

Multi-tenant security for SaaS and AI agent platforms is a layered discipline. No single control is sufficient; the goal is defense-in-depth where the failure of any one layer does not result in a complete breach.

The patterns covered in this article — RLS and schema isolation, tenant-scoped JWTs, API key hashing and scoping, refresh token rotation, SSRF-hardened webhook delivery, and agent-specific authorization controls — form the core security architecture that every platform in this space needs to get right.

For AI agent platforms specifically, the speed and autonomy of agents makes the stakes higher than in traditional SaaS: a misconfigured authorization boundary can be exploited at machine speed before any human reviewer can intervene. Building the isolation in from the start, and testing it adversarially before launch, is the only reliable approach.


Sources: