Prompt Caching for AI Agents: Architecture Patterns for Cost and Latency Optimization
Executive Summary
Prompt caching — the reuse of previously computed key-value (KV) attention tensors for repeated prompt prefixes — is one of the highest-leverage optimizations available to production AI agent systems. For agentic workloads where the same system prompt, tool definitions, and conversation history accumulate across dozens of API calls, naively ignoring caching mechanics leaves 40–90% of input token costs on the table.
This article covers how prompt caching works at the infrastructure level, how each major provider exposes it, and — critically — the architectural patterns that determine whether an agent system achieves near-100% cache hit rates or inadvertently breaks every cache it builds.
How Prompt Caching Works
The KV-Cache Mechanism
Every transformer LLM computes "key" and "value" vectors for each token when processing a prompt. These computations are expensive: for a 100K-token context, they dominate both latency and compute cost. Prompt caching stores these KV tensors server-side after the first computation. When a subsequent request shares an identical token prefix, the server reuses the cached tensors rather than recomputing them.
The condition for a cache hit is strict: byte-for-byte identical prefix. A single changed character — even a whitespace difference — causes a full cache miss. This is why architectural discipline around prompt construction is essential, not optional.
What Gets Cached vs. What Cannot
Cacheable content (stable, reusable):
- System prompt instructions
- Tool/function definitions
- Background documents and retrieval context
- Few-shot examples
- Structured output schemas
Non-cacheable (must remain dynamic):
- The current user message
- Live tool results injected mid-conversation
- Timestamps or request IDs embedded anywhere before dynamic content
- Any content that changes between requests
The fundamental rule: everything that can be cached must appear before everything that cannot.
Provider Comparison (2025–2026)
Anthropic Claude — Explicit Cache Control
Anthropic requires developers to explicitly mark cache breakpoints using cache_control. The API accepts up to 4 cache breakpoints per request. The supported type is ephemeral, with configurable TTL:
{
"cache_control": { "type": "ephemeral", "ttl": "1h" }
}
Default TTL is 5 minutes. The extended 1-hour TTL option was added in late 2025 for workloads where conversations naturally exceed 5 minutes. A cache breakpoint placed on the system prompt block means the full system prompt (including tool definitions appended to it) is cached as a prefix.
Key characteristics:
- Cost: Cached input tokens are billed at 10% of standard input token price (90% reduction)
- Latency: Up to 85% reduction for long prompts; a 100K-token prompt that previously took 11.5s drops to ~2.4s
- Cache hit rate: 100% when the cache is warm and the prefix is identical
- Workspace isolation: As of February 5, 2026, caches are isolated per workspace (not per organization). This matters for multi-tenant deployments — different workspaces cannot share caches even if they use identical prompts.
Automatic prompt caching was introduced for Claude in 2026, where Anthropic automatically places a cache breakpoint on the last cacheable block. This simplifies integration for teams that don't want to manage breakpoints manually, though explicit control remains available for optimization.
OpenAI — Automatic Prefix Caching
OpenAI's approach is fully automatic — no API changes required. The system caches the longest matching prefix of any prompt that is 1,024 tokens or longer.
Cache matching works in 128-token increments: the cached sequence length will always be in the set {1024, 1152, 1280, 1408, ...}. This means even slight prefix mismatches can drop the effective cached length dramatically.
Key characteristics:
- Cost: Cached tokens are billed at 50% of standard input price (50% reduction) — less aggressive than Anthropic's discount
- Minimum: 1,024 tokens; requests shorter than this are never cached
- Increment: 128-token blocks — a single token difference shifts the cache boundary
- No TTL control: Cache lifetime is managed automatically by OpenAI
- Supported content: Messages array, tool definitions, and structured output schemas all participate in the prefix
Because caching is automatic, the discipline burden falls entirely on prompt structure: if tool definitions or the system prompt change between requests, cache effectiveness collapses.
Google Gemini — Context Caching
Google offers both implicit prefix caching (similar to OpenAI's automatic approach) and explicit "context caching" via the API. Explicit context caching allows developers to upload a large document once and reference its cache ID across multiple requests, which is well-suited for agentic systems that repeatedly query the same knowledge base.
Google's 2M token context window combined with explicit caching makes it particularly strong for document-heavy agent workflows.
Agent Architecture Patterns for Maximum Cache Efficiency
Pattern 1: Static-First, Dynamic-Last Ordering
This is the foundational rule. Structure every prompt in this order:
1. System instructions (static)
2. Tool definitions (static within a session)
3. Few-shot examples (static)
4. Conversation history (grows, but old parts are static)
5. Latest user message (dynamic — always at the end)
Violating this order — for example, injecting a timestamp into the system prompt, or appending retrieval results before tool definitions — breaks the cache for everything that follows.
Pattern 2: Freeze Tool Definitions Early
Tool definitions for an agent session are typically fixed: the set of tools available doesn't change mid-conversation. However, teams frequently rebuild tool definition objects programmatically, and subtle differences (JSON key ordering, formatting whitespace, description updates) cause cache misses across all subsequent content.
Best practice: serialize tool definitions to a canonical JSON string once at agent startup and reuse that exact string for the session lifetime. A checksum-based approach can detect accidental mutations:
const TOOLS_CANONICAL = JSON.stringify(buildToolDefinitions(), null, 0);
const TOOLS_HASH = sha256(TOOLS_CANONICAL);
// Assert TOOLS_HASH matches expected value at startup
The payoff is significant: complex tool definitions can be 2,000–10,000 tokens. In an agent with 50 tool-calling steps, caching the tool block alone can save 100K–500K tokens per session.
Pattern 3: Separate Tool Results from the Cache Prefix
The research paper "Don't Break the Cache" (arxiv 2601.06007, 2025) established a critical finding: naive full-context caching can paradoxically increase latency relative to no caching. The cause: when tool results are voluminous and dynamic, including them in the cache breakpoint forces a cache write on every step, and the write overhead exceeds the read savings.
The optimal strategy is to cache the system prompt and tool definitions as a stable prefix, but treat tool results as non-cached dynamic content appended at the end. This yields:
- 41–80% cost reduction (vs. 0% for no caching or negative for naive caching)
- 13–31% time-to-first-token improvement
The specific strategies tested in the paper, ranked by effectiveness:
- System prompt only caching — most consistent benefits
- Full context minus tool results — good for long conversations
- Naive full-context caching — risky, provider-dependent
Pattern 4: Multi-Layered Cache Prefixes for Multi-Tenant Systems
For platforms serving multiple users (like Zylos's multi-tenant architecture), a layered approach maximizes shared cache value while maintaining isolation:
Layer 1: Global prefix — shared across ALL users
- Core system identity and capabilities
- Universal tool definitions
- Company-wide policies
↓ Cache breakpoint #1
Layer 2: User-class prefix — shared within a user segment
- Role-specific instructions
- Feature flags for that user tier
↓ Cache breakpoint #2 (Anthropic supports up to 4)
Layer 3: Per-user context — user profile, preferences
↓ Cache breakpoint #3
Layer 4: Conversation history (dynamic, grows per session)
Layer 5: Current message (never cached)
With this structure, the expensive global layer (potentially 50K+ tokens for a full instruction set) is cached once and reused across all users' requests. Anthropic's workspace-level isolation ensures that different tenant workspaces cannot share each other's cache, preventing cross-tenant data exposure.
Pattern 5: Agentic Plan Caching
A 2025 paper on "Agentic Plan Caching" introduced a higher-level approach: caching the agent's reasoning plan (the decomposed task structure), not just the prompt prefix. When an agent encounters a task semantically similar to a previously solved one, it retrieves the cached plan and adapts it rather than replanning from scratch.
Benchmarks show this approach delivers:
- 50.31% cost reduction on average
- 27.28% latency reduction on average
- Quality maintained at near-baseline levels
This is complementary to KV-cache optimization — one operates at the API call level, the other at the agent reasoning level.
Security Considerations: The KV-Cache Side Channel
A 2025 NDSS paper ("I Know What You Asked: Prompt Leakage via KV-Cache Sharing") identified a serious security vulnerability in multi-tenant LLM serving: by observing cache behavior timing, an adversarial user can reconstruct other users' prompt prefixes. The attack exploits the fact that a cache hit produces faster responses than a cache miss.
Implications for agent platforms:
- Never share caches across tenant boundaries — this is now enforced at the provider level (Anthropic workspace isolation), but self-hosted deployments (vLLM, llama.cpp) require explicit configuration
- vLLM mitigation: Use the
cache_saltparameter to prevent cross-tenant prefix sharing even when prefixes are identical - Avoid predictable cache refresh patterns: Batch cache invalidation into irregular maintenance windows to reduce timing attack surface
- Audit what's in system prompts: If a system prompt contains sensitive business logic, it will be reconstructible by sophisticated adversaries in shared infrastructure
For self-hosted agentic deployments, the recommendation is namespace-based cache isolation: each tenant's KV-cache namespace is cryptographically separated, trading some cache efficiency for strict isolation.
Semantic Caching: Beyond Prefix Matching
Exact prefix matching has a fundamental limitation: even semantically identical prompts ("What is the weather?" vs. "Tell me the weather") produce different tokens and thus different cache keys. Semantic caching addresses this by storing responses and retrieval context indexed by embedding vectors.
SemanticALLI (2025) demonstrated this at PMG's marketing analytics platform by decomposing agent pipelines into cacheable intermediate representations (IRs) rather than caching raw responses. The system:
- Achieves 78% token usage reduction vs. monolithic caching's 38.7% cap
- Caches "analytic intents" (what the user is actually asking for) rather than their literal phrasing
- Uses hybrid retrieval: exact match → semantic neighborhood → lexical fallback
For production agent systems, semantic caching is most valuable when:
- The agent answers many similar but not identical queries
- Tool call results are expensive (API costs, latency) and idempotent
- Users frequently rephrase the same underlying intent
The implementation complexity is higher than KV-cache optimization, but the ceiling for cache hit rate improvement is also much higher.
Measuring Cache Effectiveness
Blind deployment of caching without measurement is a missed optimization. Track these metrics:
| Metric | Description | Target |
|---|---|---|
| Cache hit rate | % of requests with at least partial cache hit | >80% for production agents |
| Cached token fraction | Cached tokens / total input tokens | >60% for long-context agents |
| TTFT delta | Time-to-first-token: cached vs. uncached | Should see >30% improvement |
| Input token savings | Monthly cached tokens × price delta | Track in cost dashboard |
| Cache write overhead | Cost of write-only misses | Should be <5% of total input cost |
Both Anthropic and OpenAI return cache statistics in API responses (usage.cache_read_input_tokens, usage.cache_creation_input_tokens). Log these alongside each API call and aggregate in your observability stack.
Practical Checklist for Agent Systems
Prompt structure:
- System prompt is the first block, never modified between requests
- Tool definitions are serialized canonically (no whitespace drift, stable key order)
- Tool results are appended after cache breakpoints, not before
- No timestamps, request IDs, or random values in cacheable content
- User-specific content comes after all shared content
Provider configuration:
- Anthropic:
cache_controlbreakpoints placed on system prompt and tool definition blocks - Extended TTL (
1h) configured for sessions that exceed 5 minutes - OpenAI: verify minimum prompt length ≥1,024 tokens; pad system prompt if needed for short-context use cases
- Cache hit metrics are being logged per request
Multi-tenant:
- Different tenants use separate workspaces (Anthropic) or cache namespaces (self-hosted)
- Global shared prefix is factored out and cached at the highest level
- No per-user data in the shared cache layer
Security:
- System prompts don't contain secrets (passwords, API keys) — they're reconstructible
- Self-hosted deployments use cache_salt or namespace isolation
- Cache invalidation timing is irregular, not predictable
Real-World Impact Estimates for Zylos
For a Zylos-style agent with the following profile:
- System prompt: ~3,000 tokens (identity + tools + instructions)
- Tool definitions: ~5,000 tokens (15 tools with descriptions and schemas)
- Average conversation: 20 turns, ~2,000 tokens/turn
Without caching: 20 turns × (3,000 + 5,000 + avg 10,000 history) = ~360,000 input tokens/session
With caching optimally applied:
- System + tools (8,000 tokens) cached after turn 1: saves 8,000 × 19 = 152,000 tokens at 90% discount
- Growing history partially cached: ~40,000 tokens saved
- Effective cost: equivalent to ~168,000 uncached tokens instead of 360,000 — 53% cost reduction
At Claude Sonnet pricing (~$3/M input tokens), this is $1.08 vs. $0.50 per session — roughly halving inference costs for ongoing conversations. At scale across hundreds of daily sessions, this compounds to meaningful monthly savings.
Summary
Prompt caching is not a set-and-forget feature — it requires deliberate architectural choices. The most impactful practices are:
- Structure prompts static-first: system prompt → tool definitions → history → user message
- Freeze tool definitions canonically and assert stability at startup
- Exclude dynamic tool results from cache breakpoints; let them float at the end of context
- Layer cache prefixes in multi-tenant systems: global → user-class → per-user → conversation
- Measure cache hit rate and cached token fraction in your observability stack
- Enforce tenant cache isolation to prevent KV-cache side-channel attacks
With these patterns applied, production agent systems routinely achieve 41–90% input token cost reductions and 13–85% latency improvements — with no change to model quality.

