Shared AI Service Layers in Plugin and Component Architectures

Executive Summary

When a software platform adopts AI capabilities, there are two architectural choices: let each plugin or component call LLM providers independently, or provide a shared AI service layer that all components consume. The naive approach -- every component manages its own API keys, builds its own HTTP client, and handles its own rate limiting -- feels fast at first. It collapses under its own weight once the plugin ecosystem grows beyond a handful of components.

The problem is not theoretical. CVE-2025-11749 exposed a bearer token through WordPress's REST API index because a plugin stored credentials without the coordination that a platform layer would enforce. The 2025 Figma MCP vulnerability (CVE-2025-53967) allowed prompt injection through shared design assets to leak API keys stored in plaintext. Real platforms have converged on a solution: a shared AI service layer that owns credentials, enforces rate limits, provides audit trails, and offers components a clean API that hides all of this complexity.

This article examines how production platforms -- VSCode, WordPress 7.0, JetBrains IDE Services -- implement shared AI layers, what agent frameworks (LangChain, CrewAI, AutoGen) teach us about shared model access, and six architectural principles that emerge from these implementations for platform engineers building component ecosystems.

Why Per-Component AI Integration Fails

Before examining solutions, it is worth being precise about why the naive approach fails. The failure modes are predictable and compound each other.

The attribution black hole. When each plugin calls a provider with its own key, every call is attributed to the key, not the plugin, the user, or the feature. A dozen plugins using separate keys produce logs that read: api-key: sk-abc123 called gpt-4, 1,247 tokens consumed. Which plugin? For which user? In response to which action? Impossible to know without metadata embedded at call time -- and only a centralized layer can enforce that metadata is always present. The EU AI Act (Article 19) mandates minimum six-month retention of AI interaction logs, and per-plugin credential models create compliance fragmentation that auditors cannot reconcile across vendor dashboards.

Token quota cannibalization. A single plugin running batch summarization can exhaust an organization's tokens-per-minute quota while all other plugins receive HTTP 429 errors. Traditional request-per-second rate limits are structurally inadequate for LLM workloads where a single request can consume thousands of tokens. Without a platform layer that distributes quota allocation, one misbehaving plugin degrades the entire platform.

Credential sprawl as attack surface. Major AI SDKs enforce long-lived credentials in application memory by design -- the OpenAI SDK requires an API key at client creation and maintains it for the session's lifetime. In plugin architectures where each plugin is a separate process or module, the attack surface multiplies proportionally with the number of plugins. CVE-2025-11749, affecting the AI Engine WordPress plugin, exposed a bearer token in plaintext through the public /wp-json/ REST API index due to a single missing parameter in route registration. Wordfence detected active exploitation within 24 hours of disclosure. This class of vulnerability is structurally eliminated by a platform that owns credentials and never exposes them to plugins.

Inconsistent behavior and the integration tax. Each plugin reimplements the same boilerplate: HTTP client configuration, retry logic, streaming response parsing, error normalization, model version pinning. When the provider changes a model's API (and they do), every plugin breaks independently. When the platform wants to swap the underlying model, it must coordinate across every plugin. The integration tax accumulates until maintaining the ecosystem becomes unsustainable.

How Real Platforms Implement Shared AI Layers

VSCode: The Manifest Declaration Pattern

VSCode's Language Model API represents the "blessed SDK" approach at its most mature. Extensions that provide AI capabilities declare them statically in their package.json manifest:

"contributes": {
  "languageModelChatProviders": [{
    "vendor": "my-provider",
    "displayName": "My Provider",
    "managementCommand": "my-provider.manage"
  }]
}

Consumer extensions access models through a platform API that abstracts the underlying provider entirely:

const models = await vscode.lm.selectChatModels({ vendor: 'copilot', family: 'gpt-4o' });
const chatResponse = await models[0].sendRequest(craftedPrompt, {}, cancellationToken);
for await (const fragment of chatResponse.text) {
  // handle streaming response
}

A user consent dialog fires the first time any extension calls selectChatModels(). Extensions must invoke this as part of a user-initiated action. The provider interface (LanguageModelChatProvider) defines three methods: provideLanguageModelChatInformation() returning model descriptors (ID, name, family, token limits, capability flags), provideLanguageModelChatResponse() returning streaming response parts, and provideTokenCount().

The enforcement mechanism is architectural incentive rather than hard network blocking. Calling the OpenAI API directly from an extension is technically possible in the Node.js extension host. The positive incentive to use the platform API is strong: it handles streaming, error normalization, model switching when Copilot changes underlying models, and user-visible attribution. An extension calling the provider directly must rebuild all of this and breaks whenever Copilot's underlying model changes.

This pattern -- "make the right way easier than the wrong way" -- is characteristic of platforms that cannot enforce hard network restrictions on extension code.

WordPress 7.0: The Core-Bundled Library Pattern

Before WordPress 7.0 (released March 2026), hundreds of AI plugins each maintained their own provider integration and credential management. Site owners configured API keys across a dozen different settings screens with no coordination between plugins.

WordPress 7.0 introduced an AI Client in Core with a two-layer architecture. The first layer is a provider-agnostic PHP library bundled in Core, handling raw HTTP to OpenAI, Anthropic, and Google with normalized responses. The second layer is a WordPress wrapper (WP_AI_Client_Prompt_Builder) with WordPress conventions, WP_Error returns, and filter hooks.

The fluent builder API deliberately reads like WordPress code:

wp_ai_client_prompt('Summarize this post')
    ->using_temperature(0.7)
    ->using_model_preference('claude-opus-4')
    ->generate_text();

Credentials are configured once via Settings > Connectors. The documentation is explicit: "Plugin developers using the AI Client to build features do not need to handle credentials at all."

Enforcement is architectural: plugins that bypass the official API cannot participate in the Abilities API (capability detection), the Connectors UI (credential management), or benefit from the wp_ai_client_prevent_prompt filter that lets site administrators block specific AI operations. The platform makes bypassing the standard more work than using it, while providing capabilities that independently-integrated plugins cannot replicate.

JetBrains IDE Services: Hard Enforcement at the Platform Configuration Layer

JetBrains takes the strictest approach. Administrators configure provider credentials once at the platform level via centralized IDE Services configuration -- OpenAI keys, Azure API keys, Google Vertex service account JSON, Amazon Bedrock IAM roles. Individual extensions access AI capabilities through the platform's managed identity layer and never maintain their own credentials. This is hard enforcement: there is no configuration surface where an extension can specify its own provider key.

The JetBrains approach demonstrates that in enterprise contexts with administrator-controlled deployments, the platform layer can enforce credential isolation completely rather than relying on architectural incentives.

Agent Framework Patterns: Dependency Injection as the Foundation

Agent frameworks (LangChain, CrewAI, AutoGen) are plugin architectures in a different register: the "plugins" are agents, tools, and chains that compose into larger systems. How they handle shared model access reveals the underlying mechanics of the shared service layer pattern.

The key finding: no major framework uses a global singleton. All rely on dependency injection -- the developer instantiates one client and passes the same object reference to multiple components. This is not a limitation; it is the correct design. A global singleton introduces hidden coupling between components and makes testing difficult. Explicit dependency injection makes the shared resource visible in the code structure.

In LangChain, the idiomatic pattern shares one HTTP connection pool across multiple chains:

llm = ChatOpenAI(model="gpt-4o-mini")  # single instance
humor_chain   = humor_prompt   | llm | StrOutputParser()
clarity_chain = clarity_prompt | llm | StrOutputParser()
parallel = RunnableParallel(humor=humor_chain, clarity=clarity_chain)

The LangChain global cache (langchain_core.globals) is the one genuine singleton -- a process-level cache keyed on hash(prompt + llm_string) where llm_string is a deterministic serialization of model parameters. The composite key design prevents cross-model cache pollution, which is the failure mode that makes global caches feel dangerous in practice.

AutoGen 0.4 made the dependency injection pattern architecturally decisive. AutoGen 0.2 used loose llm_config dicts with a config_list for fallback. AutoGen 0.4 requires explicit typed model client injection:

model_client = OpenAIChatCompletionClient(model="gpt-4o")
primary_agent = AssistantAgent("primary", model_client=model_client)
critic_agent  = AssistantAgent("critic",  model_client=model_client)

The same client object flows to both agents. This makes the shared resource a first-class structural element rather than an implicit global.

LangGraph v1 (2025) addressed the deep nesting problem that dependency injection creates when components are many layers removed from where the shared service is configured. Their solution was typed runtime context:

@dataclass
class AppContext:
    llm_provider: str = "openai"
    user_id: str = ""

graph = StateGraph(State, context_schema=AppContext)

def my_node(state: State, runtime: Runtime[AppContext]):
    user_id = runtime.context.user_id  # typed, IDE-friendly

LangGraph documentation explicitly discourages storing LLM clients in graph state -- state is for serializable data; LLM clients belong in module scope or runtime context. OpenAI's Agents SDK and Pydantic AI converge on the same Runtime[ContextT] pattern. This is how shared services, including AI clients, flow through deeply nested component hierarchies without global state.

Credential Propagation: How Platform Subscriptions Reach Components

The central operational question for shared AI layers is credential propagation: how does a platform-level AI subscription reach each component without each component holding raw credentials?

Virtual keys are the most concrete implementation of this pattern at scale. LiteLLM's proxy demonstrates the mechanism: each component or team receives a generated virtual key via POST /key/generate. The key carries a model allowlist, a USD budget cap, requests-per-minute and tokens-per-minute limits, and a user or team ID for spend attribution. The gateway holds real provider credentials and injects them at call time. Component authors never see raw OpenAI or Anthropic keys.

The gateway's middleware hook pipeline exposes lifecycle hooks at each stage of an AI call: async_pre_call_hook (before the LLM call, for request modification or early rejection), async_moderation_hook (parallel to the call, for content scanning without blocking), async_post_call_success_hook (after success, for logging and audit), async_post_call_streaming_iterator_hook (for each streaming chunk, enabling real-time token-by-token filtering), and async_post_call_failure_hook (after failure, for alerting and circuit-breaking). Budget enforcement is hierarchical: Organization -> Team -> User, with requests blocked when any level exceeds its budget.

For OAuth-based delegation -- relevant when components operate on behalf of specific users -- the Token Exchange Flow (RFC 8693) allows an existing token for one service to be exchanged for a scoped token for another, preserving least-privilege as requests traverse component chains. A critical production detail that most documentation omits: reactive token refresh (waiting for 401 errors) creates race conditions and cascading retry storms. Production systems must implement proactive renewal, refreshing tokens 5 minutes before expiration rather than after they expire.

Azure API Management's managed identity pattern represents the cleanest enterprise implementation: APIM's system-assigned managed identity is granted Cognitive Services OpenAI User role on the Azure OpenAI resource. The authentication-managed-identity policy obtains a short-lived access token at request time. No static credential exists anywhere in the system -- the shared AI service layer holds no secret that can be stolen, only an identity that can be audited and revoked.

Sandbox Isolation: When AI Calls Touch Local Data

Shared AI service layers create a new class of security boundary problem. When multiple components share an AI service layer and that layer can access local filesystem or database content to fulfill requests, one component's AI call can potentially reach another component's data.

The plugin cross-contamination attack is a real production threat. In the 2023-2024 ChatGPT plugin era, malicious web content summarized by one plugin could instruct the model to use another plugin's capabilities -- the Zapier integration to send emails, the calendar plugin to modify events. No plugin context isolation existed, and no boundary between plugin execution contexts prevented cross-plugin request forgery. This is a distinct problem from traditional OS-level sandboxing: network isolation alone does not prevent prompt injection through shared AI context.

The standard production isolation approach for AI-integrated plugin systems combines three layers. At the execution level, MicroVMs (Firecracker or Kata Containers) give each workload a dedicated kernel separated from the host, with boot times under 125ms and memory overhead under 5MB per instance. gVisor (user-space kernel) provides an alternative that intercepts syscalls before they reach the host kernel with 10-30% I/O overhead. At the network level, zero-trust controls block all outbound connections by default, restrict DNS resolution, and use micro-segmentation to isolate individual plugin instances. At the prompt level, the shared AI service layer must maintain context isolation between plugin sessions -- preventing one plugin's accumulated context from being accessible to another plugin's requests, even when both route through the same AI layer.

The practical takeaway for platform engineers: the shared AI service layer must treat each component's context as a separate session boundary, not just separate credentials. A gateway that shares credentials correctly but allows context to leak between component sessions has solved only half the isolation problem.

The Reference Implementation Principle

When a platform ships a shared AI service layer, the question of how to enforce its use across all components is as important as the design of the layer itself. The enforcement mechanism determines whether the platform achieves consistency or merely suggests it.

The Vercel AI SDK formalizes this through a LanguageModelV3 interface that all custom providers must implement, and names the Mistral provider explicitly as the reference implementation -- the canonical example that all third-party provider authors follow. When the specification is ambiguous, the correct behavior is determined by observing what the reference implementation does, not by reading documentation. This "reference implementation as ground truth" pattern borrows from language standards (Java TCK, W3C reference browsers) and brings it to AI SDK ecosystems.

The enforcement spectrum runs from architectural incentive to hard network enforcement:

Mechanism	Example	Strictness
Manifest declarations + consent dialog	VSCode Language Model API	Incentive-based
Core-bundled library + filter hooks	WordPress 7.0 AI Client	Architectural incentive
Virtual keys + budget caps + hook pipeline	LiteLLM Proxy	Hard proxy enforcement
Platform-managed credentials, no extension access	JetBrains IDE Services	Hard configuration enforcement
OAuth/IAM + VPC-private endpoints	AWS Bedrock AgentCore	Hard identity/network enforcement

Incentive-based enforcement is appropriate when components are developed by external parties who cannot be forced to use the platform API. Hard enforcement is appropriate when components are first-party or the platform controls the deployment environment.

The most effective platforms use both: hard enforcement for credentials (no component ever holds raw keys) combined with architectural incentives for API usage (the platform API does more than a direct provider call, making it the path of least resistance).

Six Architectural Principles for Platform Engineers

These principles converge across every successful shared AI service layer implementation.

1. The gateway as single chokepoint. All AI traffic through one component enables centralized auth, logging, rate limiting, and cost tracking that is impossible when each component calls providers directly. This is the foundational principle from which the others derive.

2. Virtual keys as the multi-tenancy primitive. Scoped keys carry model allowlists, budget caps, and rate limits. The gateway translates them to real credentials at runtime. Component authors never hold raw API keys. This single mechanism eliminates credential sprawl, the attribution black hole, and billing fragmentation simultaneously.

3. The middleware pipeline for extensible governance. Expose a lifecycle hook model identical to web framework middleware -- hooks before the call, parallel to the call, after success, after each streaming chunk, and after failure. This lets platform teams add compliance features, content scanning, and observability without modifying the gateway core, and lets component authors add component-specific behavior within platform-enforced constraints.

4. Reference implementation as standard. Ship a working reference implementation alongside the interface definition. Third-party component authors replicate the reference, not the documentation. Ambiguity in the specification is resolved through behavior, not interpretation. This approach produces more consistent implementations than documentation alone can achieve.

5. Credential injection eliminates a problem class. The gateway holds credentials and injects them at request time. No component ever holds credentials. This eliminates not just the direct credential theft threat but also the entire category of plugin-level credential mishandling vulnerabilities like CVE-2025-11749.

6. Typed runtime context for deep hierarchies. When shared services must reach deeply nested components, thread them through a typed context container (following LangGraph's Runtime[ContextT], OpenAI's Agent[ContextT], or Pydantic AI's RunContext[DepsType]) rather than through global state or implicit configuration. This keeps shared resources explicit in the code structure while avoiding the ceremony of passing them through every intermediate layer.

Trade-offs: Flexibility vs. Standardization

The shared AI service layer pattern involves genuine trade-offs that platform engineers should acknowledge rather than paper over.

Component autonomy vs. platform consistency. Components that use the platform's AI layer are constrained to models and configurations the platform exposes. A component that needs a specialized embedding model or a very specific prompt format may find the platform layer an impediment. The JetBrains approach (no extension access to credentials) maximizes consistency at the cost of component flexibility. The VSCode approach (incentive-based) preserves flexibility while achieving consistency where components choose to participate.

Latency overhead. Every proxied call adds network hops. For latency-sensitive applications, this overhead may be unacceptable. The standard mitigation is co-location (gateway and LLM backend on the same network segment, ideally via private endpoints), but this requires infrastructure investment.

Platform vendor lock-in. Components built against the platform's AI abstraction layer work within that platform and nowhere else. If the platform changes its AI layer (model names, API shapes, capability flags), components must update. WordPress's using_model_preference('claude-opus-4') ties component behavior to both WordPress's AI layer and the underlying model family. This is a deliberate trade-off: standardization enables the centralized governance that makes the pattern valuable, but that standardization is necessarily platform-specific.

The partial adoption problem. In ecosystems where component adoption of the shared layer is voluntary, the benefits of the pattern are proportional to adoption rate. Forty percent adoption means forty percent of AI calls are ungoverned, unaudited, and uncredited. Platforms must invest in making the standard path clearly superior to the bypass -- better capabilities, simpler integration, visible compliance benefits -- to achieve the adoption rate where centralized governance delivers its full value.

Practical Implications

For platform engineers evaluating whether to build a shared AI service layer:

The decision threshold is roughly two components. If you have one AI-integrated component, per-component integration is fine. At two, you have credential duplication and divergent update surfaces. At four, you have a compliance problem. Build the shared layer before you have four components, not after.

Start with credential abstraction before API abstraction. The security and compliance benefits of centralized credential management arrive immediately and independently of whether components adopt a unified API. A virtual key system with budget caps can be deployed alongside existing per-component integrations as a migration path rather than a forced cutover.

The middleware pipeline is where value compounds. The gateway's base function -- credential injection and request routing -- is table stakes. The compounding value comes from the middleware hooks: budget tracking, content moderation, audit logging, cost attribution. Each hook adds a capability that would require independent implementation in every component without the shared layer.

Invest in the reference implementation. The platform's AI integration quality is determined by the worst-integrated component in the ecosystem. A reference implementation that component authors can read, run, and copy establishes the floor. Documentation establishes the ceiling of what authors understand; working code establishes what they actually do.