Zylos LogoZylos
2026-05-23

BFF Architecture Patterns for Multi-Service AI Agent Platforms

architecturebffmicroservicesgrpcai-agents

Executive Summary

The Backend-for-Frontend (BFF) pattern — popularized by Sam Newman and adopted at scale by Netflix, SoundCloud, and Spotify — is experiencing a second wave of relevance as AI agent platforms mature into distributed systems with multiple specialized microservices. Where early AI platforms shipped as monoliths, modern agent infrastructure increasingly decomposes into discrete services: an agent-orchestration core, a memory/retrieval layer, a tool-execution runtime, a billing engine, a notification dispatcher, and a real-time event bus. Each of these services has its own schema, protocol, and latency profile. The UI — whether a web dashboard, a mobile client, or a developer-facing API — needs to speak to all of them coherently.

A generic API Gateway handles infrastructure concerns at the edge but is client-agnostic. A BFF goes further: it is owned by the frontend team, knows exactly which client it serves, and is responsible for composing, transforming, and securing the surface it exposes. For AI agent platforms, this distinction is not academic — it determines where aggregation logic lives, how auth flows through a gRPC service mesh, and what happens when the memory service goes down mid-conversation.

This article covers six dimensions: BFF vs API Gateway decision criteria, aggregation patterns over gRPC/ConnectRPC, authentication placement and principal injection, graceful degradation strategies, fan-out performance and caching, and real-world architecture patterns from companies operating at scale.


1. BFF vs API Gateway — When Each Is Appropriate

The API Gateway Role

An API Gateway operates at the network edge and handles cross-cutting infrastructure concerns: TLS termination, rate limiting, JWT signature verification, request routing, and global observability (access logs, distributed trace injection). Critically, the Gateway is client-agnostic — it does not know or care whether the caller is a React dashboard, a mobile app, or an automated CLI script. It routes traffic; it does not shape payloads.

For an AI agent platform, a Gateway is the right tool for:

  • Token budget enforcement — rejecting calls that exceed per-tenant rate limits before any downstream service is touched
  • Edge authentication — verifying that a Bearer token is cryptographically valid without knowing anything about the user's workspace permissions
  • Multi-region routing — directing traffic to the nearest cluster based on latency or data-residency constraints
  • DDoS mitigation — absorbing traffic spikes before they propagate inward

The BFF Role

A BFF is a client-owned service that sits behind the Gateway and handles application-level composition. It knows the shape of the UI it serves and encodes that knowledge in code. For an AI agent platform:

  • The web dashboard BFF might call the agent-orchestration service, the usage-billing service, and the notification service in parallel, joining their results into a single DashboardSummary response.
  • A mobile BFF might request the same data but with tighter payload constraints — stripping fields that the mobile client never renders, and paginating aggressively to reduce bandwidth.
  • A developer API BFF might expose a more granular surface, passing through tool-execution results with full metadata rather than the summarized form the web UI displays.

Decision Criteria

SituationUse API GatewayUse BFF
Single client typeYesNot needed
Multiple clients with divergent data needsNecessarySufficient
Cross-cutting concerns (rate limiting, TLS)YesNo
Client-specific aggregation and transformationNoYes
Client teams own the backend contractNoYes
Protocol translation (REST → gRPC internally)PartialYes

The canonical deployment topology pairs them: the Gateway sits at the edge handling infrastructure, then routes to the appropriate BFF, which handles application composition. They are not alternatives — they are layers.

The Monolithic BFF Anti-Pattern

A common mistake is building a single BFF for all clients. This recreates the original problem: the BFF bloats with conditional logic to serve incompatible client requirements, its deployment becomes high-risk, and ownership becomes diffuse. The rule: one BFF per client type (web, mobile, third-party developer API). If two clients share more than ~80% of their surface, a shared BFF may be justified — but the threshold is high.


2. BFF Aggregation Patterns Over gRPC and ConnectRPC

Modern AI agent platforms increasingly run their internal service mesh over gRPC or ConnectRPC. gRPC provides strongly-typed contracts via Protocol Buffers, efficient binary framing, multiplexed HTTP/2 connections, and native support for bidirectional streaming. ConnectRPC builds on this with superior browser compatibility and supports the gRPC, gRPC-Web, and Connect protocols on a single port.

The BFF is where REST or GraphQL (spoken externally) meets gRPC (spoken internally). This section describes three aggregation patterns.

Pattern 1: Parallel Fan-Out and Join

The most common pattern: the BFF issues multiple gRPC calls concurrently and merges the results before responding to the client.

Client ──HTTP──▶ BFF
                  │
          ┌───────┼───────┐
          ▼       ▼       ▼
      AgentSvc  BillingSvc  NotifySvc
       (gRPC)    (gRPC)    (gRPC)
          │       │         │
          └───────┴─────────┘
                  │
          merge + transform
                  │
          ◀── HTTP Response

In Go, this is a sync.WaitGroup or errgroup pattern:

g, ctx := errgroup.WithContext(ctx)

var agentData  *agentpb.AgentSummary
var billingData *billingpb.UsageSummary

g.Go(func() error {
    var err error
    agentData, err = agentClient.GetSummary(ctx, &agentpb.GetSummaryReq{WorkspaceId: wsID})
    return err
})
g.Go(func() error {
    var err error
    billingData, err = billingClient.GetUsage(ctx, &billingpb.GetUsageReq{WorkspaceId: wsID})
    return err
})

if err := g.Wait(); err != nil {
    return nil, err
}
return merge(agentData, billingData), nil

In TypeScript/Node.js with ConnectRPC, the equivalent uses Promise.all — but with a critical caveat: unconstrained parallelism can saturate connection pools. Libraries like p-limit enforce a maximum concurrency ceiling (commonly 8–12 concurrent gRPC calls) to avoid melting the downstream cluster under bursty load.

Pattern 2: Sequential Pipeline with Enrichment

Some aggregations are inherently sequential — the output of one call determines the input to the next. A common AI platform pattern:

  1. Call the Session service to retrieve the active conversation context
  2. Use the returned session_id to call the Memory service for relevant retrieved facts
  3. Use the returned memory fragments to call the Agent service for response generation

Each step enriches the request context for the next. The BFF owns this pipeline and can short-circuit early if any step fails (see Section 4).

Pattern 3: gRPC Federation (Declarative Aggregation)

For platforms with many services and complex join logic, declarative aggregation tools like Mercari's grpc-federation allow expressing composition rules directly in Protocol Buffer options. The tool auto-generates a BFF server from these definitions. This approach trades flexibility for consistency — the composition graph is type-checked at compile time, and changes to upstream service schemas break the BFF build rather than causing silent runtime failures.

This pattern is appropriate when the aggregation graph is stable, the team is comfortable with code generation, and the number of upstream services exceeds ~6–8, at which point hand-written fan-out logic becomes difficult to audit.


3. Authentication Placement — BFF as the Auth Boundary

Why Auth Belongs at the BFF, Not the Gateway

The Gateway performs token verification — confirming that a JWT is signed correctly and not expired. This is a cryptographic operation, fast and stateless. But token verification is not the same as authorization. Knowing that a token is valid tells you nothing about whether user_id: 42 has permission to read workspace_id: 99.

Authorization — determining what a verified identity is allowed to do — requires business logic. That logic belongs at the BFF layer, not at the Gateway (which should remain infrastructure-level) and not scattered across every downstream service (which creates duplicated, inconsistent enforcement).

The BFF becomes the auth boundary: it receives the validated token identity from the Gateway (typically via a forwarded header), resolves it against a permissions model, and constructs a Principal — a typed struct encoding the authenticated identity and their effective permissions.

Injecting Principal via gRPC Metadata

Once the BFF has resolved the Principal, it must propagate it to every downstream gRPC service without requiring each service to re-derive it from the original token. The mechanism is gRPC metadata — key-value pairs attached to the gRPC call context, analogous to HTTP headers.

A common convention:

// BFF resolves and serializes the Principal
principal := &authpb.Principal{
    UserId:      claims.Sub,
    WorkspaceId: resolvedWorkspace,
    Roles:       resolvedRoles,
}
serialized, _ := proto.Marshal(principal)

// Inject into outgoing gRPC metadata
md := metadata.New(map[string]string{
    "x-principal-bin": string(serialized), // binary metadata (-bin suffix)
})
ctx = metadata.NewOutgoingContext(ctx, md)

// All downstream gRPC calls on this ctx carry the Principal
agentClient.Execute(ctx, req)

Downstream services extract the Principal from incoming metadata and trust it — they do not independently verify the original JWT. This creates a trust boundary: services inside the mesh trust the BFF to have done auth correctly; services outside the mesh (reached directly) would need their own verification.

The Token Leakage Problem

A key security motivation for the BFF pattern is preventing long-lived access tokens from ever reaching the browser. The BFF maintains session state (e.g., HttpOnly cookies), exchanges short-lived tokens with downstream services, and never exposes raw credentials to the client. The browser holds only an opaque session identifier; the BFF holds the real credentials. This dramatically reduces the blast radius of XSS attacks.


4. Graceful Degradation When Downstream Services Are Unavailable

The Failure Taxonomy

In a fan-out BFF calling 4–6 downstream services, failures are not binary. The relevant categories are:

  • Critical service unavailable: The call cannot proceed at all (e.g., the agent-orchestration service is down and the user is trying to run an agent)
  • Supplementary service unavailable: The core response can still be returned with degraded richness (e.g., the notification-count service is down; return the dashboard without the unread count)
  • Transient error: A service returned a 503 or a gRPC UNAVAILABLE status that may resolve on retry
  • Timeout: A service is responding but slowly, threatening the overall response latency budget

Circuit Breaker Integration

The circuit breaker pattern prevents a slow or failing downstream service from consuming all BFF goroutines/threads. It operates in three states:

  1. Closed (normal operation): requests flow through; failure metrics are collected
  2. Open (failure threshold exceeded): requests to the failing service are fast-rejected immediately, returning a fallback without attempting the real call
  3. Half-Open (recovery probe): a small fraction of requests are allowed through to test whether the service has recovered

For AI agent platforms, the circuit breaker threshold configuration matters: a memory retrieval service that degrades under load should open the circuit after a lower error rate (e.g., 5% failures over 10 seconds) than a non-critical notification service.

Libraries: Hystrix (Java), Resilience4j (Java), go-resiliency (Go), cockatiel (TypeScript).

The Partial Response Pattern

For non-critical services, the BFF should return a partial response rather than an error. The client receives the data it needs for core functionality, with absent sections clearly marked:

{
  "agent": { "status": "idle", "lastRun": "2026-05-23T10:00:00Z" },
  "usage": null,
  "_degraded": ["usage"],
  "_degradedReason": "billing service unavailable"
}

The client renders what it has and can display a non-blocking warning. This is strictly preferable to returning a 503 for the entire page load because one optional service failed.

Stale-While-Revalidate Caching as Fallback

For read-heavy operations, the BFF can serve a previously cached response when a downstream service is temporarily unavailable. The stale-while-revalidate pattern serves the cached value immediately while asynchronously attempting to refresh it. For AI agent dashboards, data like usage totals, agent configuration, and billing summaries tolerate seconds or even minutes of staleness — the cache becomes the reliability safety net.


5. Performance: Fan-Out Latency, Caching, and Avoiding the BFF Bottleneck

The Fan-Out Latency Problem

The BFF's response time is bounded by the slowest service in a parallel fan-out. With four services at p50 latencies of 10ms, 15ms, 8ms, and 20ms, the BFF p50 is ~20ms. But at the p99 — where tail latency matters — a single slow service can drag the entire response to 500ms+. This is the tail latency amplification problem: the more services you fan out to, the worse your p99 gets.

Mitigation strategies:

Aggressive timeouts per call, not per total response: Each downstream gRPC call should have its own context deadline shorter than the overall BFF deadline. A 200ms overall deadline might translate to 80ms per downstream call with retries, ensuring no single call occupies the full budget.

Hedged requests: For latency-critical paths, issue a duplicate request to the same service after a short delay (e.g., 50ms) and take whichever response arrives first. The tail of the latency distribution collapses at the cost of ~1.5x average load on downstream services. Use judiciously.

Selective fan-out: Not every UI panel needs to load in the critical path. The BFF can return a fast first response for above-the-fold data and leave slower supplementary services to a deferred call or a streaming update.

Caching Architecture at the BFF Layer

The BFF sits at the ideal point for caching: it knows exactly which client is requesting data and can construct cache keys that incorporate client identity, workspace, and API version.

A layered caching strategy:

  1. In-process cache (L1): Sub-millisecond access, limited size. Good for configuration data, user profile info, and permission lookups that are read thousands of times per minute and change rarely. TTL: 30–60 seconds.

  2. Distributed cache (L2): Redis or Memcached. Shared across BFF instances. Good for aggregated view data that multiple users in the same workspace might request. TTL: 5–30 seconds depending on freshness requirements.

  3. Response-level cache: Cache the fully composed BFF response for idempotent read operations, keyed by the request hash. This is high-risk for personalized data but effective for organization-level dashboards.

The cache invalidation strategy for AI platforms is nuanced: agent state changes frequently and must not be served stale, but historical usage data and billing summaries are good candidates for aggressive caching.

Avoiding the BFF Bottleneck

As the BFF aggregates more services, it risks becoming a bottleneck itself — a service that all traffic flows through and that, if it goes down, takes the entire frontend with it.

Mitigation:

  • Stateless BFF instances: All session state lives in a distributed store (Redis). BFF instances are interchangeable and horizontally scalable.
  • Per-client BFF isolation: Different client BFFs (web, mobile, developer API) have separate deployments. A bug or overload in the mobile BFF does not affect the web dashboard.
  • Avoid synchronous blocking on the BFF: Long-running operations (e.g., waiting for an agent to finish executing a multi-step tool chain) should use asynchronous patterns — the BFF returns a job ID and the client polls or subscribes via WebSocket/SSE, rather than the BFF holding an HTTP connection open.

6. Real-World Patterns from Multi-Service Architectures

Netflix: Per-Platform BFF at Scale

Netflix runs distinct BFF layers for each device category: TV (10-foot experience, pre-fetching large thumbnails, offline caching), mobile (touch-optimized, smaller payloads, adaptive streaming quality), and web (rich metadata, browsing-first interactions). Each BFF is owned by the corresponding device team and deploys independently. The Gateway handles edge routing and token validation; BFFs handle composition and device-specific transformation. Netflix has written extensively about their Falcor/GraphQL-based aggregation layers, which were essentially BFF implementations before the pattern had a canonical name.

SoundCloud: BFF as the Migration Path from Monolith

SoundCloud's migration story is instructive for AI platforms at an earlier stage. They began with a Rails monolith serving all clients. As the monolith became difficult to change, they introduced BFFs as a seam: the BFF translated client requests into calls to the monolith initially, then incrementally rerouted individual calls to new microservices as those were carved out. The BFF absorbed the complexity of the transition, shielding clients from the internal restructuring. For AI platforms mid-migration from a monolithic core to microservices, this pattern is directly applicable.

Spotify: GraphQL BFF for Composition Flexibility

Spotify has publicly described using GraphQL at their BFF layer, with downstream services still using REST and gRPC internally. GraphQL's schema-stitching capability maps well to the BFF's aggregation role: the client specifies exactly the fields it needs, the BFF resolves those fields from whichever downstream services own them, and only the necessary data is transferred. For AI agent platforms with highly variable UI panels (some showing agent-execution timelines, others showing cost breakdowns, others showing memory graphs), GraphQL's client-driven query model can reduce the number of bespoke BFF endpoints that need to be maintained.

The AI Agent Platform Pattern: BFF as the Orchestration Boundary

Emerging AI agent platforms add a dimension that traditional BFF literature does not fully address: the BFF is not only aggregating static microservices but is coordinating with non-deterministic execution flows. An agent might take 2 seconds or 120 seconds; its output might trigger downstream tool calls whose completion must be surfaced back to the UI.

The pattern that works well here:

  1. Synchronous BFF for initiation: The client calls the BFF to start an agent run; the BFF calls the orchestration service, gets a run_id, and returns immediately. No long-held connections.
  2. Event stream for progress: The BFF exposes a Server-Sent Events (SSE) or WebSocket endpoint that the client subscribes to. The BFF bridges an internal event bus (e.g., NATS, Kafka) to the client's connection, streaming execution events as they arrive.
  3. Polling fallback: Clients that cannot maintain persistent connections poll a BFF endpoint that caches the latest run state, populated by the same internal event stream.
  4. gRPC streaming internally: Between the BFF and the orchestration service, a gRPC server-streaming RPC delivers execution events efficiently without polling overhead.

This hybrid pattern — synchronous for write operations, event-driven for state propagation — preserves the BFF's role as a clean abstraction boundary while avoiding the pitfall of blocking on slow AI inference.

Auth Pattern in Production: Zero-Trust Inside the Mesh

Several mature AI platforms have moved to a zero-trust internal mesh where every gRPC service validates the Principal injected by the BFF rather than trusting network-level controls alone. The BFF serializes the Principal into gRPC metadata; downstream services deserialize and verify it against a shared signing key or a lightweight internal auth service. This prevents compromised internal services from escalating privileges by constructing arbitrary Principals.

The pattern requires careful key rotation procedures and adds ~0.5–2ms per call for deserialization and verification — a cost that is almost always acceptable given the security improvement.


Key Takeaways

  • Layer BFF over Gateway, not instead of it: the Gateway handles infrastructure, the BFF handles application composition. They solve different problems.
  • One BFF per client type: resist the temptation to build a universal BFF; it becomes the monolith you were trying to escape.
  • Auth boundary at the BFF: verify tokens at the Gateway edge, resolve permissions and construct the Principal at the BFF, propagate via gRPC metadata to all downstream services.
  • Fail gracefully at the service level: design every downstream call as either critical (fail the whole response) or supplementary (return partial response with _degraded markers). Never let a non-critical service take down the whole page.
  • Control fan-out concurrency: parallel gRPC calls with errgroup (Go) or Promise.all + p-limit (TypeScript) avoid both sequential latency and connection pool exhaustion.
  • Cache at the BFF, not the services: the BFF has the context to build correct cache keys (user, workspace, API version) and the placement to serve cached data without a network hop.
  • For AI agent platforms specifically: decouple initiation (synchronous BFF) from progress propagation (event stream bridging) to avoid blocking on non-deterministic execution.

Sources: