Zylos LogoZylos
2026-05-30

Semantic Caching for AI Agents: Production Architecture for Cost and Latency Reduction

semantic-cachingcost-optimizationai-agentsvector-searchllm-infrastructureproduction

Executive Summary

AI agents are expensive to run. A production autonomous agent handling thousands of daily requests — planning steps, answering sub-questions, calling tools — rapidly accumulates token spend that scales linearly with usage. Semantic caching attacks this problem at the query level: instead of asking the model the same (or nearly the same) question twice, you serve a cached response based on vector similarity. Done right, this delivers 40–80% cost reduction and up to 15x latency improvement on cache hits, with no perceptible quality loss for suitable workloads.

This article covers the full production picture: how semantic caching works, where it fits in the agent architecture stack, how to tune it correctly, what breaks if you get it wrong, and which infrastructure options are mature enough for production in 2026.


How Semantic Caching Works

Classical caching uses exact key matching — you hash the input and look it up. This misses the fundamental characteristic of natural language: the same intent expressed differently produces different strings.

"What is the refund policy?" and "How do I get a refund?" are semantically equivalent. An exact-match cache sees them as unrelated keys. A semantic cache sees them as nearly the same question.

The mechanism:

  1. Incoming prompt is converted to a dense vector embedding (e.g., via OpenAI's text-embedding-3-small, Cohere's Embed, or a locally hosted model)
  2. The embedding is compared against a vector store of previously cached embeddings using cosine similarity (or dot product)
  3. If the nearest neighbor exceeds a configured similarity threshold, the cached response is returned
  4. Otherwise, the request proceeds to the LLM, and both the embedding and response are stored for future hits

The critical path on a cache hit: embedding generation (~10-30ms) + vector lookup (~1-5ms) = under 50ms total, versus 1–5 seconds for a live model call.


The Three-Layer Caching Stack

In production agent systems, semantic caching is one layer in a multi-tier strategy. Each layer attacks a different problem:

┌─────────────────────────────────────────────────────────┐
│  Layer 3: Edge / CDN Response Cache                     │
│  → Deterministic outputs, URL-driven, sub-20ms          │
├─────────────────────────────────────────────────────────┤
│  Layer 2: Semantic Cache                                 │
│  → Eliminates the LLM call entirely                     │
├─────────────────────────────────────────────────────────┤
│  Layer 1: Provider Prompt Cache                         │
│  → Cuts per-token cost on calls that do reach the LLM   │
└─────────────────────────────────────────────────────────┘

Layer 1 — Provider Prompt Caching: Anthropic, OpenAI, and Google all offer native prefix caching. Anthropic provides a 90% discount on cached input tokens via explicit cache_control: { type: "ephemeral" } markers; OpenAI auto-caches prefixes over 1024 tokens at 50% discount. This reduces the cost per call, but the call still happens.

Layer 2 — Semantic Caching: Eliminates the call entirely for repeated or paraphrased queries. Cache hits save 100% of inference cost (minus embedding overhead) and reduce latency from seconds to milliseconds.

Layer 3 — Edge Caching: For fully deterministic, URL-keyed outputs (autocomplete, article summaries, pre-computed recommendations), Cloudflare KV or Redis with short TTLs can serve responses with zero model involvement.

A real-world deployment combining all three layers reported going from $7,800/day to $515/day across 500K daily requests — a 93% cost reduction — with approximately two weeks of implementation effort.


Architecture for Agentic Systems

Agents introduce complications that simple chat applications don't face. Several agentic patterns interact poorly with naive semantic caching.

Where to Cache in an Agent Loop

An agent loop has multiple LLM invocations per user task: planning, tool selection, sub-question answering, result synthesis. Not all of these are cacheable.

User Request
     │
     ▼
┌─────────────────┐
│  Planning Step  │  ← HIGH cache potential (similar tasks yield same plan)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Tool Selection │  ← MEDIUM (context-dependent, but often repetitive)
└────────┬────────┘
         │
    ┌────┴────┐
    ▼         ▼
┌──────┐  ┌──────┐
│Tool 1│  │Tool 2│  ← DO NOT CACHE tool calls (side effects)
└──────┘  └──────┘
         │
         ▼
┌─────────────────┐
│  Synthesis Step │  ← LOW cache potential (unique inputs each time)
└─────────────────┘

Critical rule: Never cache tools/call operations. Tool invocations have side effects — writing files, sending messages, querying live databases. A cached response here would return stale data and suppress the intended effect. Restrict semantic caching to read operations and reasoning steps.

In an MCP-based system, this maps directly: cache resources/read and prompts/get results, never tools/call results.

Sub-Query Caching

A powerful pattern for research and RAG agents: decompose the user query into sub-questions, and apply semantic caching at the sub-question level. Research shows that at optimal thresholds, agents serve 3 of every 4 sub-questions from cache, reducing LLM calls from 8 to 4–6 per task and cutting end-to-end latency by roughly a third.

The 2026 paper "Semantic Caching and Intent-Driven Context Optimization for Multi-Agent Natural Language to Code Systems" (arXiv:2601.11687) demonstrates this at production scale: 67% cache hit rate on over 10,000 enterprise inventory management queries, with 94.3% semantic accuracy maintained. The key innovation is dual-threshold decision making — one threshold for exact retrieval, a lower threshold for "reference-guided" generation where the cached response is used as a template rather than returned verbatim.

Conversation History Problem

Long conversations are a trap. As conversation history grows, it dominates the embedding, making each turn's embedding look increasingly unique. A question asked at turn 20 won't match the same question asked at turn 3, even if the wording is identical.

The practical solution: configure a conversation_history_threshold (Bifrost defaults to 3 messages). Beyond that depth, either:

  • Cache only the current user turn embedding, excluding history
  • Fall through to the LLM without caching
  • Segment the conversation and cache independently per segment

Similarity Threshold Tuning

The threshold is the most critical operational parameter. Set it too high and you get few hits; set it too low and you serve incorrect responses to distinct queries.

Use CaseRecommended ThresholdRationale
Code generationDo not cacheSmall differences in requirements produce different correct code
Safety / moderation decisionsDo not cacheRisk of incorrect classification is too high
FAQ / customer support0.94–0.97High precision required; false hits cause wrong answers
Documentation Q&A0.90–0.95Moderate variation expected in phrasing
Internal knowledge search0.88–0.92Same team, shared vocabulary, lower risk
Classification / extraction0.85–0.92High repetition, bounded output space
General chat (low stakes)0.85–0.90Broad matching acceptable

Tuning process:

  1. Start at 0.95 and monitor hit rate vs. accuracy on a sample of queries
  2. Sample cache hits manually — are the served responses actually correct for the incoming query?
  3. Adjust threshold incrementally (0.01 steps) and re-sample
  4. Domain-specific embedding models (e.g., medical, legal, financial) allow higher thresholds because their embedding space is more semantically precise

Tenant Isolation and Privacy

Multi-tenant agents must scope cache keys to prevent cross-tenant data leakage. The most dangerous pattern is a global semantic cache where one tenant's query inadvertently serves another tenant's cached private data.

Correct key scoping:

# Naive (dangerous): global cache
cache_key = embed(query)

# Correct: tenant-scoped cache
cache_key = embed(f"[tenant:{tenant_id}] {query}")

# Or: separate namespace per tenant
cache_namespace = f"semantic-cache:{tenant_id}"

The trade-off: tenant-scoped caches have lower hit rates (no cross-tenant sharing), but this is generally the right trade-off. An alternative is a shared public cache for non-sensitive universal queries (e.g., "What is the capital of France?") with a separate private cache per tenant for domain-specific queries — but this requires careful classification of which queries are safe to share.

MCP-based caching systems like mcp-refcache formalize this with explicit namespace scoping: public, session, user, organization, with granular permissions per namespace.


Infrastructure Options in 2026

Open Source Libraries

GPTCache (~8,000 GitHub stars): The most widely adopted open-source semantic caching library. Modular architecture supporting OpenAI, Hugging Face, and ONNX embeddings with pluggable vector backends (Milvus, FAISS, Chroma). Good for embedding caching directly in application code.

from gptcache import cache
from gptcache.adapter import openai

cache.init(
    embedding_func=openai_embedding,
    data_manager=manager_factory("sqlite,faiss", vector_params={"dimension": 1536}),
    similarity_evaluation=SearchDistanceEvaluation(max_distance=0.3)
)

# Drop-in replacement — same API, cache-aware
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the refund policy?"}]
)

Gateway-Native Caching

Bifrost (~3,800 GitHub stars): High-performance Go LLM gateway with 11 microsecond overhead per request at 5,000 req/sec. Provides adapters for Weaviate, Redis/Valkey, Qdrant, and Pinecone. Per-request control via HTTP headers:

POST /v1/chat/completions
x-bf-cache-key: session-abc123
x-bf-cache-ttl: 1h
x-bf-cache-threshold: 0.92
x-bf-cache-type: semantic

LiteLLM: Popular multi-provider proxy now supports redis-semantic and qdrant-semantic cache backends with configurable thresholds. Integrates with existing LiteLLM deployments with minimal config changes.

Managed Cloud Services

AWS ElastiCache + Bedrock: AWS published research on 63,796 real chatbot queries demonstrating 86% cost reduction and 88% latency improvement at optimal thresholds, with 91% response accuracy maintained above 90% hit rate. Architecture: ElastiCache for Redis with vector search, connected to Bedrock's embedding models.

Maxim AI Gateway: Managed semantic caching with built-in analytics, hit rate dashboards, and threshold experimentation tooling. Lower operational overhead than self-hosted.


TTL Strategy

Cache TTL must match the volatility of the underlying data:

Content TypeRecommended TTL
Static documentation24h–7d
FAQ responses2h–24h
Pricing / policy (infrequent changes)30min–2h
Live inventory / statusSkip semantic caching, or 30–60s
User session contextSession-scoped (clear on logout)

When source knowledge is updated (new documentation deployed, policy changed), invalidation is critical. Options:

  • TTL expiry: Simple but stale during the window
  • Tag-based invalidation: Associate cached responses with source document IDs; invalidate on document update
  • Version stamping: Include a knowledge base version in the cache key; increment on updates to force full cache refresh

Anti-Patterns to Avoid

1. Embedding timestamps or request IDs in system prompts

# Broken: every request has a unique embedding
system_prompt = f"You are an assistant. Request ID: {uuid4()}"

This makes every embedding unique, defeating the cache entirely. Keep system prompts static and stable.

2. Non-deterministic tool schema ordering

If tool schemas are assembled dynamically and their order varies, the prompt varies, and the embedding shifts even when the semantic content is identical. Sort tools deterministically.

3. No monitoring

Cache hit rates should be tracked as a first-class metric. A hit rate drop from 65% to 30% is a signal that your query distribution has shifted or that TTL tuning is needed — not something you want to discover in a cost report weeks later.

4. Missing eviction policy

Vector stores without eviction grow unboundedly. Implement LRU or TTL-based eviction; for Redis, configure maxmemory-policy allkeys-lru.

5. Loose thresholds for high-stakes domains

A 0.85 threshold on a medical symptom checker is dangerous. "Is chest pain serious?" and "Is stomach pain serious?" might share enough semantic content to hit the cache — with potentially harmful results. Default to "no caching" and opt in explicitly.


Measuring ROI

A simple model for cache ROI:

daily_cost_savings = hit_rate × daily_requests × avg_cost_per_request
                   - daily_requests × embedding_cost_per_request
                   - infrastructure_cost_per_day

break_even_hit_rate = (infrastructure_cost + embedding_cost) / (avg_cost_per_request × daily_requests)

At typical production numbers (100K daily requests, $0.05/request, $0.0001/embedding, $50/day for Redis):

  • At 30% hit rate: $1,490/day savings
  • At 50% hit rate: $2,450/day savings (~$894K/year)
  • At 70% hit rate: $3,410/day savings

The embedding cost is typically negligible — OpenAI's text-embedding-3-small is ~$0.00002/request, adding $2/day at 100K requests. The dominant cost is infrastructure (vector store, Redis), which is fixed regardless of request volume.


Key Implementation Checklist

  • Identify cacheable vs. non-cacheable steps in your agent loop (never cache tool calls)
  • Choose embedding model appropriate for your domain vocabulary
  • Configure per-tenant namespace scoping to prevent cross-tenant leakage
  • Set initial similarity threshold at 0.95, tune down from measured samples
  • Implement TTL policy matched to content volatility
  • Add cache invalidation hooks to your knowledge base update pipeline
  • Instrument hit rate, miss rate, and latency as first-class metrics
  • Configure eviction policy on the vector store
  • Set conversation_history_threshold for multi-turn agents
  • Test for false hits by sampling cache-served responses against ground truth

Conclusion

Semantic caching is one of the highest-leverage optimizations available for production AI agent systems. The economics are compelling: at 50% hit rates, a mid-scale deployment saves nearly $1M/year. The technology is mature — GPTCache, Bifrost, LiteLLM, and managed cloud options all provide production-ready implementations. The main engineering investment is in tuning (threshold selection), isolation (tenant scoping), and observability (hit rate monitoring).

The important constraint is scope: semantic caching works well for read-heavy, query-repetitive workloads — FAQ, documentation, sub-question answering, classification. It should be explicitly disabled for tool calls, code generation with precise requirements, and any safety-sensitive decision path. Instrument first, then cache — the hit rate data will tell you where caching pays.


References