Zylos LogoZylos
2026-03-27

Prompt Caching and KV Cache Optimization for Long-Running AI Agent Sessions

researchaiagentsoptimizationinfrastructure

Executive Summary

Prompt caching has become one of the highest-leverage optimizations available to production AI agent systems in 2026. Across Anthropic, OpenAI, and Google, cached input tokens cost 50–90% less than uncached tokens and deliver 65–85% lower time-to-first-token latency. For long-running agents with large, stable system prompts, the break-even point is remarkably low — just 1.4–2 cache hits per cached prefix. Yet naive caching strategies can paradoxically increase latency and cost by creating cache entries that are never reused. This article covers the full stack: KV cache internals, provider APIs, system prompt design, context window management, and production patterns that consistently achieve 80%+ cache hit rates.


KV Cache Internals: What Actually Gets Cached

Before examining provider APIs, it helps to understand what is being cached at the inference level.

How Transformer Inference Uses KV Caches

During autoregressive generation, a transformer must compute self-attention over all prior tokens for each new token generated. Without caching, this means re-processing the entire prompt on every decoding step — a quadratic cost. The KV cache solves this by storing the computed key (K) and value (V) tensors for every attention layer and every prior token. On subsequent steps, the model retrieves stored K/V pairs and only computes attention for the newest query token.

This is the intra-request KV cache — it is standard in all modern inference engines and essentially free to the developer. The optimization described in this article is the cross-request KV cache: reusing K/V tensors from a previous request when a new request begins with the same prefix.

Prefix Caching

When two requests share an identical token prefix, the K/V tensors for that prefix are mathematically identical. Inference engines like vLLM implement Automatic Prefix Caching (APC) using a radix tree structure (RadixAttention in SGLang) to identify and reuse matching prefixes across requests. Each KV block is addressed by a hash of its content and its position in the prefix tree, enabling O(1) lookups.

The memory trade-off is explicit: GPU HBM used for the KV cache comes at the expense of batch size. Production systems must tune cache size against throughput. vLLM's LRU eviction policy (effectively equivalent to RadixAttention's leaf-node eviction) handles this, but the 2025 KVFlow paper demonstrates that LRU is suboptimal for multi-agent workflows where agents have known execution schedules. KVFlow's workflow-aware eviction achieves up to 2.19× speedup over standard LRU in high-concurrency multi-agent scenarios by preserving caches for agents known to run soon.

Automatic vs. Explicit Caching

At the provider level, there are two models:

  • Automatic (OpenAI): No code changes required. The provider automatically caches the longest matching prefix of any prompt over 1,024 tokens, in 128-token increments. Developers get a 50% discount on cache hits with no cache write cost and no TTL management.
  • Explicit (Anthropic): Developers mark specific content blocks with cache_control directives. Offers more control, higher discounts (90% on reads), and configurable TTLs (5 minutes or 1 hour). Requires deliberate API design but enables more targeted cache strategies.
  • Explicit (Google): Named cache objects that must be created, managed, and expired by the developer. Most flexible but highest operational overhead.

Provider Mechanics and Pricing

Anthropic Claude Prompt Caching

Anthropic's implementation gives the most control and the highest savings potential.

API syntax:

# Automatic caching (top-level cache_control)
response = client.messages.create(
    model="claude-sonnet-4-6",
    cache_control={"type": "ephemeral"},
    system="Your long, stable system prompt...",
    messages=[{"role": "user", "content": user_message}]
)

# Explicit breakpoints (granular control)
response = client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {
            "type": "text",
            "text": "Static agent instructions and tool schemas...",
            "cache_control": {"type": "ephemeral", "ttl": "1h"}
        },
        {
            "type": "text",
            "text": f"Today's date: {today}. User context: {user_ctx}",
            # No cache_control — this changes per request
        }
    ],
    messages=[...]
)

Key parameters:

  • Up to 4 explicit cache breakpoints per request
  • Minimum cacheable prefix: 2,048 tokens (Sonnet 4.6) or 4,096 tokens (Opus 4.6, Haiku 4.5)
  • TTL: 5 minutes ({"type": "ephemeral"}) or 1 hour ({"type": "ephemeral", "ttl": "1h"})
  • Cache write cost: 1.25× base (5-min) or 2× base (1-hour)
  • Cache read cost: 0.1× base (90% discount)
  • Lookback window: system checks up to 20 blocks backward to find a matching cache entry

Monitoring:

{
  "usage": {
    "cache_creation_input_tokens": 5000,
    "cache_read_input_tokens": 10000,
    "input_tokens": 50,
    "output_tokens": 200
  }
}

The 85% latency improvement figure (100K token prompt: 11.5s → 2.4s TTFT) requires a warm cache — the first request always pays full write cost.

OpenAI Prompt Caching

OpenAI's approach optimizes for zero developer effort:

  • Fully automatic — no API changes, no configuration
  • Activates on prompts over 1,024 tokens, caching in 128-token increments
  • 50% discount on cache hits; no write surcharge
  • Cache TTL: 5–10 minutes of inactivity, always cleared within 1 hour
  • Latency reduction: up to 80% TTFT improvement
  • Supported: GPT-4o, GPT-4o mini, o-series, and fine-tuned variants

The trade-off: less control means less optimization potential. You cannot force cache breakpoints, cannot control TTL, and cannot cache specific subsections. Cache hit rate depends entirely on how consistent your prompt prefix is across requests.

Google Gemini Context Caching

Google's implementation is the most explicit and carries storage costs:

  • Two modes: Implicit (automatic, like OpenAI) and Explicit (named cache objects, like Anthropic's 1-hour cache)
  • Cached tokens: 10% of standard input cost (90% discount)
  • Storage cost: billed by the hour per million tokens stored, prorated to the minute
  • TTL: defaults to 1 hour, configurable (no stated minimum or maximum)
  • Minimum cacheable content: varies by model

Explicit caching is a good fit for workloads where large documents (codebases, legal contracts, knowledge bases) are queried repeatedly over hours. The storage cost is a material consideration that Anthropic's model avoids.


System Prompt Design for Cache-Friendly Agents

The single most impactful practice is ordering content by stability: most stable content first, most dynamic content last.

The Canonical Layer Order

For an agent system, the optimal prompt structure is:

[1] Tool definitions (rarely change)
    └── cache_control: ephemeral, ttl: 1h

[2] Static agent instructions (never change)
    └── cache_control: ephemeral, ttl: 1h

[3] Background documents / knowledge base
    └── cache_control: ephemeral (5-min, refreshed on each request)

[4] Session context / user profile
    └── No cache_control (changes per user)

[5] Conversation history
    └── No cache_control (grows every turn)

[6] Current user message
    └── Always at the very end

Every token before the first dynamic element can be cached. Misplacing a timestamp or user ID in layer 1 or 2 destroys cache utility for all subsequent tokens.

The High-Entropy Antipatterns

Research from the "Don't Break the Cache" paper (arXiv 2601.06007) identifies the most common cache-breaking patterns in agentic systems:

  1. Timestamps in system prompts: "Current time: 2026-03-27T14:32:17Z" — changes every second, invalidates the entire cached prefix
  2. Session IDs embedded early: "Session: abc-123-xyz" in the system prompt header
  3. Dynamic tool definitions: Adding or removing tools mid-session forces full cache invalidation
  4. Non-deterministic serialization: JSON objects with non-sorted keys produce different token sequences for identical data
  5. Model switching mid-session: Caches are model-specific — switching to a cheaper model rebuilds from scratch

Deterministic Serialization

When including structured data (tool schemas, user profiles, retrieved documents), always sort keys:

import json

# Bad: key order may vary
system_context = json.dumps(user_profile)

# Good: deterministic, cache-stable
system_context = json.dumps(user_profile, sort_keys=True)

A controlled experiment measuring this practice found a 65% median improvement in TTFT and a 85.2% cache hit rate on stable prefixes, versus 0% cache hits on non-deterministic prefixes of the same data.


Context Window Management for 24/7 Agents

Long-running agents face a compounding problem: conversation history grows indefinitely, but prompt caches depend on stable prefixes. Naively appending all history eventually pushes stable content out of the 20-block lookback window, causing cache misses on previously cached material.

The Stability-Recency Tension

The core tension is:

  • Cache efficiency requires the prefix to remain stable across requests
  • Context quality requires recent conversation history to be appended to the messages array
  • Context limits require old history to eventually be removed or compressed

The solution is to treat context as a layered structure with different cache strategies per layer.

Hierarchical Context Architecture

Layer A: System prompt (immutable)
├── Tool definitions            [1-hour cache, never changes]
├── Agent persona/instructions  [1-hour cache, never changes]
└── Static knowledge            [5-min cache, refreshed each call]

Layer B: Session context (slow-changing)
├── User profile summary        [5-min cache, update on profile change]
└── Active task state           [5-min cache, update on task transitions]

Layer C: Conversation (always growing)
├── Compressed history summary  [5-min cache, checkpoint every N turns]
├── Recent verbatim turns       [no cache, last 3-5 exchanges]
└── Current user message        [no cache]

Layers A and B remain stable and benefit from caching. Layer C's verbatim portion grows with each turn, but regular compression checkpoints limit growth.

Conversation Compression with Cache Preservation

When compressing old history, the key is to produce a deterministic summary that can itself be cached:

def compress_history(turns, model, client):
    """Compress old turns into a cached summary block."""
    summary = summarize(turns)  # deterministic summarizer

    # The summary becomes a cached block in subsequent requests
    return {
        "type": "text",
        "text": f"[Conversation summary]\n{summary}",
        "cache_control": {"type": "ephemeral"}
    }

Anthropic's compact-2026-01-12 compaction API provides production-ready automatic compaction across Claude API, AWS Bedrock, Google Vertex AI, and Microsoft Foundry with zero data retention support.

The LangChain Deep Agents SDK triggers compression at configurable context utilization thresholds (typically 75%), preserving the system prompt, active task, and last N turns verbatim while compressing the rest. This approach attributed 65% of enterprise AI failures in 2025 to context drift rather than raw context exhaustion — validating the importance of active context management.

Sliding Window with Cache Anchor

A practical implementation for multi-turn agents:

class CacheAwareContext:
    def __init__(self, max_verbatim_turns=5, compress_at=0.75):
        self.system_blocks = []      # Always cached
        self.summary_block = None    # Cached summary
        self.recent_turns = []       # Verbatim, uncached

    def add_turn(self, role, content, usage):
        self.recent_turns.append({"role": role, "content": content})

        # Compress when approaching context limit
        if usage.total_tokens / self.model_context_size > self.compress_at:
            self._compress()

    def _compress(self):
        to_compress = self.recent_turns[:-self.max_verbatim_turns]
        summary = self._summarize(to_compress)

        # Create new cached summary block
        self.summary_block = {
            "role": "user",
            "content": [{
                "type": "text",
                "text": f"[Prior conversation summary]\n{summary}",
                "cache_control": {"type": "ephemeral"}
            }]
        }
        self.recent_turns = self.recent_turns[-self.max_verbatim_turns:]

    def to_messages(self):
        messages = []
        if self.summary_block:
            messages.append(self.summary_block)
        messages.extend(self.recent_turns)
        return messages

Cost Optimization and Break-Even Analysis

The Economics

With Anthropic's pricing at approximately $3.00/M input tokens:

OperationCost per million tokens
Regular input$3.00
Cache write (5-min)$3.75 (1.25×)
Cache write (1-hour)$6.00 (2×)
Cache read$0.30 (0.10×)

Break-even for a 5-minute cache: 1.4 reads per cached prefix. For a 1-hour cache: 2.0 reads per cached prefix. Any agent processing more than 2 requests in an hour with the same system prompt should use caching.

When Caching Hurts

Caching is counterproductive when:

  1. Conversation turnover is higher than TTL: If each session is a fresh user with a unique system prompt, there are no repeat reads
  2. Prompts are below minimum token thresholds: Caching won't activate below 1,024–4,096 tokens depending on model
  3. High-entropy prefixes: If any early content is dynamic, write cost is paid with no read benefit
  4. Tool results are cached naively: Tool outputs are session-specific and unlikely to be reused across sessions; caching them wastes write budget
  5. Context pruning is active: If old tool calls are summarized or pruned, any cached representations of that content become invalid

Production Cache Hit Rates

Industry benchmarks from 2025 production deployments:

  • RAG-based customer support (Anthropic): 85% cost reduction, stable system prompt + cached document corpus
  • Code analysis pipeline (OpenAI): 60% savings, large stable code context
  • General SaaS with 50% hit rate: ~49% cost reduction at 100K daily requests
  • Target for healthy implementations: >70–80% cache hit rate (measured as cache_read_input_tokens / total_input_tokens)

The research-backed recommendation is to monitor hit rate continuously. A hit rate below 50% signals either a short-lived prompt structure or a dynamic prefix problem.


Practical Patterns for Agent Systems

Pattern 1: The Frozen System Prompt

The highest-ROI pattern: never modify the system prompt at runtime. Treat it as a compiled artifact.

# Build at startup, never touch during session
SYSTEM_PROMPT_BLOCKS = [
    {
        "type": "text",
        "text": load_tool_definitions(),        # Sorted, deterministic
        "cache_control": {"type": "ephemeral", "ttl": "1h"}
    },
    {
        "type": "text",
        "text": load_agent_instructions(),      # Static markdown
        "cache_control": {"type": "ephemeral", "ttl": "1h"}
    }
]

If user-specific or time-specific context is needed, place it in the first user message, not in the system prompt.

Pattern 2: General-Purpose Tool Definitions

The "Don't Break the Cache" paper recommends maintaining a fixed set of general-purpose, reusable tool definitions rather than dynamically generating tools per request. For capabilities that vary by session, implement them via code generation inside a stable execute_code tool rather than adding new tools:

// Bad: new tool per capability breaks cache
{"name": "fetch_user_john_data", "description": "..."}

// Good: stable tool, dynamic behavior via parameter
{"name": "execute_code", "description": "Execute Python code to accomplish tasks"}

Pattern 3: Cross-Session Cache Continuity

For 24/7 agents serving multiple users, the system prompt is the primary cross-session cache asset. Session-specific data (user ID, conversation history) lives only in the messages array. This means:

  • System prompt cache is shared across all sessions for the same model + same instructions
  • Adding a new user requires no cache warm-up for the system prompt portion
  • Per-session cost approaches the cache-read rate (0.1×) for the majority of tokens after the first few requests

Pattern 4: Multi-Turn Conversation Caching

For long multi-turn sessions, use automatic top-level cache_control and let the provider advance the breakpoint:

def chat(messages, client):
    response = client.messages.create(
        model="claude-sonnet-4-6",
        cache_control={"type": "ephemeral"},  # Advances automatically
        system=SYSTEM_PROMPT_BLOCKS,
        messages=messages
    )
    return response

The automatic breakpoint advances to the end of the conversation on each request, meaning each completed turn gets cached. If the user resends the same message (a common pattern in retries), the entire prior conversation is served from cache.

Pattern 5: Cache Monitoring and Alerting

Track the key metrics per request and aggregate for dashboards:

def log_cache_metrics(usage):
    total_input = (
        usage.cache_read_input_tokens +
        usage.cache_creation_input_tokens +
        usage.input_tokens
    )
    hit_rate = usage.cache_read_input_tokens / max(total_input, 1)

    metrics.record({
        "cache_hit_rate": hit_rate,
        "cache_write_tokens": usage.cache_creation_input_tokens,
        "cache_read_tokens": usage.cache_read_input_tokens,
        "uncached_tokens": usage.input_tokens,
        "estimated_savings": usage.cache_read_input_tokens * 0.9 * price_per_token
    })

    if hit_rate < 0.5:
        alert("Cache hit rate below 50% — check for dynamic prefix content")

Emerging Research and Open-Source Ecosystem

KVFlow (2025): Multi-Agent Workflow-Aware Caching

The KVFlow paper (arXiv 2507.07400) demonstrates that standard LRU eviction is fundamentally mismatched to multi-agent workflows. In an agent pipeline where agents execute in sequence, LRU tends to evict KV caches of upcoming agents (which have been idle) while retaining caches of recently completed agents (which won't run again soon). KVFlow replaces LRU with an Agent Step Graph that computes "steps-to-execution" for each agent and uses this to make eviction decisions. Combined with overlapped prefetching from CPU to GPU, it achieves:

  • 1.83× speedup over SGLang for large prompts in single workflows
  • 2.19× speedup in high-concurrency multi-agent deployments

This suggests that as multi-agent architectures mature, inference serving systems will need first-class workflow awareness rather than generic memory management.

vLLM Automatic Prefix Caching

vLLM's APC (enabled by default in recent releases) uses PagedAttention's block structure: each KV block covers a fixed token window and is addressed by a hash of its content and prefix position. Blocks are reference-counted; zero-reference blocks are LRU-evicted. This implementation is transparent to application developers and provides the same prefix caching benefits seen at the provider level for self-hosted deployments.

For teams running Claude, GPT, or Gemini via APIs, the provider handles this internally. For teams hosting Llama 3, Mistral, or other open models, configuring vLLM's APC is the equivalent optimization.

The COMPRESSION.md Protocol

The community-driven COMPRESSION.md specification (compression.md) proposes a standardized protocol for agent context compression. It defines semantic checkpoints, importance scoring for conversation turns, and interoperable compression signals that agents can emit to trigger summarization. As 24/7 agents become more common, standardized compression protocols reduce the need for custom implementations.


Implementation Checklist

For teams deploying long-running agent systems, the following checklist captures the high-priority items:

System Prompt Design

  • All static content (instructions, tool schemas) precedes all dynamic content
  • Tool definitions are sorted and deterministically serialized
  • No timestamps, session IDs, or user-specific data in the system prompt
  • Tool set is fixed for the session lifetime; no adding/removing tools mid-session
  • TTL chosen based on request frequency (5-min for >1 req/5min, 1-hour otherwise)

Context Management

  • Compression threshold set (recommend 75% of context window)
  • Verbatim recent turns preserved through compression (last 3–5 exchanges)
  • Compression summaries are deterministic and cached with cache_control
  • Tool results are not naively cached if pruning/summarization is active

Monitoring

  • Cache hit rate tracked per request and aggregated (target: >70%)
  • Cache write vs. read token ratio monitored for break-even verification
  • Alert configured for hit rate drops below 50%

Cross-Session Strategy

  • System prompt treated as a shared cache asset across all user sessions
  • User-specific context placed in messages array, not system prompt
  • Model switching avoided mid-session

Summary

Prompt caching in 2026 delivers genuine, production-measurable cost and latency improvements — but only when the prompt structure is designed to support it. The fundamental insight is that caching is a correctness problem masquerading as a performance problem: if any dynamic content breaks the prefix, caching does nothing. Treating the system prompt as an immutable, compiled artifact, managing context growth with explicit compression checkpoints, and monitoring cache hit rates as a first-class metric are the practices that separate 80% hit rates from near-zero ones. The 90% cost discount on cached input tokens effectively makes long, stable system prompts free after the first few requests — a structural cost advantage that compounds over the lifetime of a 24/7 agent deployment.


Sources: