Zylos LogoZylos
2026-04-05

AI Agent Memory Architectures: From Context Windows to Persistent Knowledge

ai-agentsmemory-architectureragcontext-managementknowledge-graphsvector-databases

Executive Summary

Memory is the defining architectural challenge for production AI agents. A stateless LLM, however capable, cannot build relationships, learn from mistakes, or carry expertise across sessions without an explicit memory layer bolted on top. Over 2025 and into 2026, the agent ecosystem converged on a remarkably consistent three-tier taxonomy — episodic, semantic, and procedural memory — that mirrors decades of cognitive science research. What differs radically across frameworks is how they implement each tier, what storage backends they use, and how they balance the perpetual tension between in-context availability and out-of-context retrieval.

This survey covers the full stack: the theoretical taxonomy and its implementation in major frameworks (MemGPT/Letta, LangGraph, CrewAI, Mem0, Zep, Cognee), the mechanics of context window management and retrieval, the emerging pattern of declarative memory injection via CLAUDE.md and AGENTS.md files, and the open problems that the community has not yet solved. The picture that emerges is one of rapid consolidation: hybrid vector-graph stores are becoming the standard backend, benchmarks are maturing enough to differentiate real systems, and the security and privacy challenges — memory poisoning, GDPR compliance, stale facts — are finally getting the serious treatment they warrant.

The central practical conclusion is that no single storage paradigm dominates. Vector databases excel at fuzzy semantic recall but are structurally blind to relationships. Knowledge graphs handle relational and temporal reasoning with deterministic precision but demand ontology maintenance. Production-grade agents increasingly rely on hybrid architectures that layer both, with an LLM-managed interface that decides what to store, when to retrieve, and when to forget. The frameworks that have shipped this architecture at scale — Mem0, Zep/Graphiti, Letta — are pulling ahead of competitors that still treat memory as an afterthought.

A final dimension worth frontloading: evaluation is still immature. The two dominant benchmarks, LoCoMo and LongMemEval, test conversational recall over extended sessions, but neither captures procedural memory quality, cross-agent consistency, or resistance to poisoning. Until the evaluation infrastructure matures, production teams will continue to rely on application-level heuristics rather than principled comparisons.


Memory Taxonomy: Episodic, Semantic, and Procedural

The cognitive science literature distinguishes memory types by the kind of information stored and the mode of retrieval. AI agent frameworks have adopted this taxonomy with increasing fidelity.

Episodic Memory

Episodic memory records specific past events: what happened, when, in what context. In agent systems, this maps directly to conversation logs, tool-call traces, and interaction sequences. The key property is temporal ordering — episodic memories are not just facts, they are facts anchored to a moment in time.

Implementation varies widely. Letta's recall memory is a searchable SQLite or PostgreSQL log of all prior messages, pageable into the context window on demand. Zep's Graphiti engine stores episodic subgraphs with explicit bitemporal annotations: event time (when the fact was true in the world) and ingestion time (when the agent first observed it). This bitemporal design enables precise reasoning over retroactive corrections — if a user updates their address, Graphiti can distinguish the new fact from the old one without losing either.

The LoCoMo benchmark from Snap Research is the current standard for evaluating episodic memory. Each LoCoMo conversation spans up to 35 sessions, 300 turns, and 9,000 tokens, with questions categorized as single-hop, multi-hop, temporal, and open-domain. MAGMA (Multi-Graph based Agentic Memory Architecture, arxiv 2601.03236) achieved the highest LoCoMo judge score of 0.7 as of early 2026, outperforming MemoryOS (0.553), A-MEM (0.58), and Nemori (0.59). MAGMA stores episodic events in a dedicated event subgraph and uses cross-graph traversal to answer multi-hop temporal questions.

Semantic Memory

Semantic memory stores declarative facts and relationships: user preferences, entity properties, domain knowledge, and the structured knowledge that agents accumulate over time. Unlike episodic memory, semantic memory is largely atemporal — it represents what the agent believes to be currently true.

Mem0 is the most widely deployed semantic memory layer as of mid-2026, with approximately 48,000 GitHub stars and $24M in funding raised in October 2025. Its semantic layer extracts named entities and relationships from conversations via an LLM pipeline, stores them as nodes and edges in a graph database, and cross-links them to vector embeddings for fuzzy search. The conflict detection step — where new facts are compared against existing graph entries and merged, updated, or flagged for resolution — is the critical architectural innovation.

LangMem (launched early 2025 as the official memory layer for LangGraph) supports semantic memory through its persistent store layer. Agents can read and write structured facts as JSON objects scoped to a user, session, or agent namespace, with automatic semantic deduplication. LangMem's p95 search latency of 59.82 seconds, however, makes it better suited for offline processing than real-time retrieval in latency-sensitive applications.

Cognee takes the most research-forward approach to semantic memory. Its cognify pipeline runs six stages: classify documents, check permissions, extract text chunks, run an LLM to extract (subject, relation, object) triplets, generate summaries, then embed into a vector store and commit edges to a knowledge graph. Its memify operation then refines the resulting graph — pruning stale nodes, reweighting edges based on usage frequency, and adding derived facts. The result is a self-improving knowledge structure, not a static index.

Procedural Memory

Procedural memory encodes how to do things: learned workflows, successful tool-call sequences, and behavioral heuristics. This is the most intellectually interesting and least mature memory tier.

The defining implementation challenge is that procedural memory in LLMs is best represented as instructions, not data. When an agent learns that "always validate JSON with schema X before calling API Y," that knowledge is most useful injected as a system-prompt directive, not retrieved from a vector store. This makes procedural memory architecturally different from the other two tiers.

Two patterns have emerged. The first is static procedural memory via markdown configuration files — CLAUDE.md, AGENTS.md, and .cursorrules — which encode learned conventions as human-readable instructions injected at the start of every session (covered in depth in the Emerging Patterns section). The second is dynamic procedural memory, where agents update their own system instructions at runtime. LangMem's SDK supports this explicitly: agents can call a update_system_prompt function that rewrites a designated memory block in their own context, encoding a newly learned heuristic for the rest of the session.

The research frontier here is represented by papers like MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory (January 2026) and MemEvolve: Meta-Evolution of Agent Memory Systems (December 2025), which explore learning procedural patterns from episodic traces without human curation.


Context Window Management vs. External Memory

The fundamental tension in agent memory design is between in-context availability and out-of-context retrieval. Information inside the context window is immediately accessible to the LLM with no latency and no retrieval error. Information in an external store must be retrieved, introducing latency, relevance error, and additional token cost.

The Cost of In-Context Memory

Full-context approaches — loading entire conversation histories into the context window — are operationally simple but economically unsustainable at scale. For a 100,000-token context window running at $15/million input tokens, a single long session can cost dollars in input tokens alone. Full-context also does not scale to multi-session continuity: there is no context window large enough to hold months of interaction history.

The "infinite context" illusion deserves explicit attention. Several frontier models now support million-token context windows (Gemini 1.5 Pro, Claude 3's extended context tiers). This tempts teams to avoid memory architecture entirely and simply append history. Research consistently finds this approach degrades performance: LLMs exhibit a "lost in the middle" attention pattern where facts near the beginning and end of very long contexts are recalled more reliably than those in the middle. Long contexts also impose quadratic attention compute costs and make prompt caching less effective because the cache key changes with each new message.

Sliding Window and Summarization Chains

The simplest context management strategy is a sliding window: keep the N most recent turns in context and discard older ones. This is the default in most off-the-shelf chat interfaces and works adequately for short-horizon tasks. For agents with session continuity requirements, it fails as soon as the conversation exceeds N turns.

Summarization chains address this by compressing older turns into a rolling summary that is prepended to the context window. The implementation in practice is a two-buffer model: a raw message buffer holding the last K turns, and a summary buffer holding a compressed representation of everything before that. When the raw buffer overflows, an LLM summarization call compresses the oldest turns into the summary buffer.

"Observational memory" (described in VentureBeat coverage, 2025) extends this pattern with a dated-observation format: instead of free-text summaries, an Observer agent produces structured timestamped notes ([2025-11-14] User prefers Python over TypeScript for backend scripts). These observations are compressed enough to hold thousands in a single context window while being precise enough to reference in reasoning. The claimed cost reduction is 10x compared to full-context methods.

Hierarchical Memory Architecture

The production consensus in 2025-2026 is a three-tier hierarchy:

  1. In-context working memory — the current session's raw message buffer, tool calls, and active state. Ephemeral. Cleared on session end.
  2. Session-scoped compressed memory — summarized or extracted facts from the current session, readable on demand.
  3. Long-term persistent store — cross-session knowledge, user preferences, learned workflows, persisted in vector+graph storage.

Letta's architecture is the clearest implementation of this hierarchy. Core memory (always in-context, 2-4KB by default) holds the agent's current understanding of the user and the active task — equivalent to RAM. Archival memory is an external vector store with no size limit, searchable via embedding similarity. Recall memory is a conversation history log pageable in chunks. The LLM itself controls all three tiers via function calls: core_memory_replace, archival_memory_search, archival_memory_insert. This LLM-managed paging metaphor, inspired by OS virtual memory, is Letta's core innovation.

Selective Retrieval and RAG for Memory

Retrieval-Augmented Generation applied to agent memory means treating the external memory store as a knowledge base and fetching relevant memories before each LLM call. The retrieval query is typically the user's last message, the current task description, or a synthesized query derived from the conversation context.

The quality of this retrieval step determines the quality of the agent's memory access. Weaviate's 2025 "Context Engineering" framework distinguishes between memory layer retrieval (historical interactions), knowledge layer retrieval (domain facts), and working memory injection (current state). Each layer has different retrieval semantics — temporal proximity matters for memory, semantic relevance for knowledge, and freshness for working memory — and using the same retrieval strategy for all three is a common architectural mistake.


Production Memory Systems

Letta (formerly MemGPT)

Letta, the production version of the MemGPT research project, raised a $10M seed round from Felicis Ventures in September 2024 and has since become the reference implementation for LLM-managed tiered memory. Its three-tier model (core/archival/recall) maps cleanly to the episodic/semantic/procedural taxonomy and handles paging entirely through LLM function calls, giving the agent full autonomy over what to remember and what to forget.

Letta's architecture has a meaningful weakness that its own benchmarking blog acknowledges: LLM-managed memory operations add latency and token cost to every interaction where the model decides to page. For high-frequency interactions, this overhead accumulates.

LangGraph Checkpointing and LangMem

LangGraph (part of the LangChain ecosystem) addresses persistence through a checkpointing layer: every node execution produces a serializable state snapshot that can be written to PostgreSQL, SQLite, Redis, or MongoDB. Checkpoints enable interrupted workflow resumption, time-travel debugging (replaying from any prior state), and multi-session continuity.

The LangMem SDK, launched in early 2025, adds a higher-level memory abstraction on top of LangGraph's persistent store: named memory namespaces (user/session/agent), automatic semantic deduplication, and the procedural memory system-prompt update mechanism. MongoDB's integration with LangGraph (announced late 2025) provides Atlas Vector Search-backed long-term memory that persists across sessions with sub-100ms retrieval for typical agent workloads.

CrewAI Memory Architecture

CrewAI divides agent memory into four types: short-term memory (recent interactions within the current execution, backed by ChromaDB with local embedding), long-term memory (learnings from past task executions, stored in SQLite), entity memory (named entity extraction, stored in ChromaDB), and user memory (user-specific preferences, via Mem0 integration). The architecture is pragmatic rather than principled — each type uses a different backend with different retrieval semantics, and the developer is responsible for ensuring coherent retrieval across types.

Mem0

Mem0's April 2026 research paper (arxiv 2504.19413) formalizes its architecture: a two-phase pipeline (extraction via LLM, followed by conflict detection and graph update), a three-scope hierarchy (user/session/agent), and a hybrid backend combining vector similarity search with knowledge graph traversal. Performance claims include 26% higher response accuracy vs. OpenAI's native memory system on LOCOMO, 91% lower p95 latency, and 90% fewer tokens versus full-context methods.

The graph variant (Mem0g) adds entity relationship modeling: an Entity Extractor identifies nodes, a Relations Generator infers labeled edges, and a Conflict Detector flags overlapping entries for LLM-powered resolution. The 2% accuracy improvement over base Mem0 from adding the graph layer suggests that relationship modeling provides incremental but real benefits for complex queries.

Zep and Graphiti

Zep's January 2025 paper (arxiv 2501.13956) introduced Graphiti, a temporally-aware knowledge graph engine built on Neo4j. Graphiti's key innovation is bitemporal edge annotation: every relationship carries both event time (when the fact was true) and ingestion time (when the agent first observed it). This enables precise handling of contradictory or updated facts without information loss.

Zep's retrieval stack is notable for its latency profile: hybrid search combining semantic embeddings, BM25 keyword search, and graph traversal achieves P95 retrieval latency of 300ms with no LLM calls during retrieval. On the Deep Memory Retrieval (DMR) benchmark, Zep achieves 94.8% accuracy vs. MemGPT's 93.4%. Accuracy improvements of up to 18.5% with 90% latency reduction versus baseline implementations are reported in the paper.

Cognee

Cognee is an open-source knowledge engine (~12,000 GitHub stars as of early 2026) that takes a developer-first approach: its six-stage cognify pipeline and self-refining memify operation can be invoked in six lines of Python. Its 14 retrieval modes span from simple vector similarity to chain-of-thought graph traversal. Cognee's architecture explicitly separates session memory (short-term, in-context embeddings and graph fragments) from permanent memory (long-term knowledge artifacts cross-linked in the graph), and its self-improving memify cycle — pruning stale nodes, reweighting edges by usage frequency, adding derived facts — is closest in spirit to how biological memory consolidation actually works.


Embedding-Based vs. Structured Storage

Vector Databases

Vector databases (Pinecone, Weaviate, Chroma, Qdrant) store information as high-dimensional embedding vectors and retrieve by semantic similarity (cosine distance or dot product). They excel at unstructured fuzzy recall: "find everything related to database optimization" returns useful results even if the query does not match any stored text literally.

The structural limitation of vector databases is fundamental: vectors encode similarity but not structure. A query like "who manages the person who approved the API deployment?" requires traversing a relationship chain — Deployment → Approver → Manager — that has no natural representation in embedding space. The vector store will return semantically similar documents but cannot guarantee the correct relationship traversal. This limitation becomes critical as agent workloads grow more relational: multi-step reasoning, entity tracking, and provenance queries all stress vector retrieval beyond its design parameters.

Practically, 95% of RAG systems as of 2025 relied exclusively on vector embeddings (per MachineLearningMastery analysis). The gap between this baseline and hybrid architectures represents significant headroom for improvement.

Knowledge Graphs

Knowledge graphs (Neo4j, Amazon Neptune, TigerGraph) model information as typed nodes and directed labeled edges. Relationship queries are deterministic: the system traverses edges, not embedding space, which eliminates semantic drift and hallucination for structured queries. Neo4j's hosting of the Graphiti project (Zep's knowledge graph layer) reflects growing ecosystem alignment between graph databases and the agent memory space.

The implementation cost of knowledge graphs is higher than vector databases: developers must define ontologies, maintain schema coherence, and build or adapt entity extraction pipelines. For rapidly evolving agent domains, rigid ontologies can become maintenance burdens. Cognee's self-refining approach partially addresses this by allowing the graph schema to evolve via memify, but graph schema migration remains more complex than schema-less vector updates.

Hybrid Architectures

Research consistently shows that hybrid architectures combining vector and graph retrieval outperform either alone. A common production pattern:

  1. Semantic entry via vector similarity search to identify candidate nodes
  2. Graph traversal from those entry nodes to extract relational context
  3. Reranking by combining vector similarity scores with graph distance metrics
  4. Context assembly for the LLM

HybridRAG (evaluated in a 2025 VLIZ paper) outperforms both pure VectorRAG and GraphRAG on retrieval accuracy and answer generation. Zep's implementation of this pattern in Graphiti — BM25 + embedding + graph traversal with no LLM calls at retrieval time — is the most production-validated example as of mid-2026.

Storage backend consolidation is an emerging trend. Rather than maintaining separate vector, graph, and relational databases, teams are moving toward unified platforms: PostgreSQL with pgvector for vector search plus standard SQL for structured queries, or MongoDB with Atlas Vector Search. Specialized memory layers like Mem0 and Zep abstract these backends behind a unified API, hiding polyglot persistence complexity from application code.


Emerging Patterns

Declarative Memory Injection: CLAUDE.md, AGENTS.md

The most widely adopted memory pattern in AI coding agents is also the simplest: a markdown file in the project root that is injected into the LLM's context at the start of every session. Claude Code reads CLAUDE.md; Codex reads AGENTS.md; Cursor reads .cursorrules. In 2025-2026, these formats are converging toward a cross-tool standard.

These files function as curated semantic and procedural memory: they record coding conventions, architectural decisions, preferred libraries, deployment checklist items, and any persistent instruction the team wants to propagate across sessions. The format is LLM-optimized: terse, declarative, no conversational filler.

The critical property that makes this pattern powerful is agent-writability. Claude Code's auto-memory system (visible from approximately v2.0.64, late 2025) allows Claude to write new entries back to CLAUDE.md autonomously during a session when it learns something useful — a correction, a new convention, a failure mode. The AutoDream system, introduced in February 2026, runs between sessions as a background sub-agent that reviews and consolidates memory files during idle time, analogous to REM sleep memory consolidation. This creates a compound learning loop: each session's learnings become the next session's prior knowledge.

Augment Code's 2026 guide recommends splitting AGENTS.md into subdirectories once it exceeds 150-200 lines, as token budget constraints make monolithic files counterproductive. Tools like Memorix (open source, mid-2025) extend the pattern further with a cross-agent MCP-based memory layer compatible with Cursor, Claude Code, Codex, Windsurf, and Gemini CLI — enabling shared procedural memory across different coding agents working on the same codebase.

Cross-Session Learning and Skill Acquisition

Beyond passive configuration files, the research frontier explores active cross-session learning: agents that not only read persistent memory but also improve it based on task outcomes. MemRL (January 2026) trains an agent to selectively write to episodic memory based on reinforcement signals — storing memories that led to successful outcomes and forgetting those that did not. MemEvolve (December 2025) takes a meta-learning approach, evolving the agent's entire memory management strategy rather than just the contents.

In production, cross-session learning most commonly manifests as structured feedback loops: a post-task summary agent extracts key learnings, human review gates whether they should be committed to long-term memory, and a memory-write agent persists approved entries. The human-in-the-loop gate is a pragmatic response to memory poisoning risk (see next section).

Multi-Agent Shared Memory

Multi-agent systems introduce coordination challenges that single-agent memory architectures are not designed to handle. LangGraph's approach — a centralized TypedDict state object with reducer functions controlling merge semantics — provides the simplest shared memory model. All agents read and write to the same state; reducer functions (typically last-write-wins or list-append) handle concurrent updates. MongoDB's LangGraph integration extends this to cross-session persistence, allowing multi-agent teams to resume work across sessions.

The research literature identifies three emerging multi-agent memory patterns: shared global memory (all agents have full read/write access), hierarchical memory (global team-level + role-specific + agent-private tiers), and isolated memory with message-passing (no shared state, explicit communication only). The EMNLP 2025 MemoryOS paper, which achieves F1 improvements of 49.11% on LoCoMo, uses a hierarchical three-tier model that closely mirrors LRU cache management — a sign that distributed systems memory management theory is transferring productively to agent architectures.

Memory-Aware Planning

Memory-aware planning uses retrieved past experience to inform future action selection. Rather than replanning from scratch each session, a memory-aware agent retrieves cases where it previously faced similar tasks, extracts what worked and what did not, and uses that experience to bias its planning decisions.

The most concrete implementation is case-based reasoning backed by episodic memory: at planning time, the agent retrieves the K most similar past tasks, reviews their outcomes, and generates a plan that adapts successful patterns and avoids documented failure modes. This is functionally equivalent to few-shot prompting with dynamically retrieved examples — the key difference being that the examples come from the agent's own history rather than a static curated dataset.


Memory Consistency and Retrieval Challenges

Stale Memories

Stale memories are facts that were once true but are no longer valid. A user's employment status, a project's tech stack, a pricing tier — all are mutable facts that accumulate in agent memory and can mislead future reasoning if not updated.

Addressing staleness requires explicit temporal modeling. Zep's bitemporal annotations allow superseded facts to be preserved for audit purposes while flagged as non-current. Mem0's conflict detection pipeline compares new information against existing graph entries and resolves contradictions via LLM judgment. Time-to-live (TTL) policies on memory entries — effectively expiring memories after a configured period — provide a blunter but more reliable control: poisoned or stale memories cannot survive beyond their TTL.

Memory Poisoning Attacks

Memory poisoning is an adversarial attack where malicious content is injected into an agent's long-term memory store through ordinary interaction, causing the agent to behave incorrectly in future sessions. The attack surface is large: any user input, retrieved web content, or tool output that is summarized and persisted is a potential injection vector.

MINJA (Memory Injection Attack), documented by Dong et al. in 2025, achieves over 95% injection success rate and 70% attack success rate against agents with naive memory stores. The Echoleak incident (2024) demonstrated real-world impact: a prompt hidden in an email caused an agent to leak private information from prior conversations.

LLM-based detection is insufficient alone: research by A-MemGuard finds that standalone analysis misses 66% of poisoned entries because malicious content appears benign in isolation — its harmful intent only manifests in combination with a specific triggering query. Effective defense requires layered controls: anomaly monitoring on memory write patterns, composite trust scoring combining temporal signals with content analysis, behavioral drift detection on agent outputs, and periodic audits that examine memories in context rather than individually.

Unit 42 (Palo Alto Networks, 2025) documented indirect prompt injection through agent memory as a high-confidence threat vector for enterprise AI deployments, elevating memory security from a research curiosity to a production concern.

Temporal Reasoning Over Memories

Even with accurate temporal annotations, reasoning over time-ordered memories is challenging for LLMs. Temporal question answering — "what did the user prefer before they changed their mind in November?" — requires the model to correctly apply bitemporal semantics, which is not reliably present in base LLMs without fine-tuning or structured prompting.

The LoCoMo benchmark specifically includes temporal question categories, and scores on these categories are consistently lower than single-hop factual retrieval across all evaluated systems. MAGMA's superior LoCoMo score (0.7 vs. 0.553 for MemoryOS) is partly attributed to its dedicated temporal reasoning component, which uses graph traversal over time-ordered episodic subgraphs rather than relying on LLM reasoning alone.


Open Problems

Memory Scalability

As agents accumulate months or years of history, memory stores grow in both size and retrieval complexity. Vector databases scale horizontally, but graph traversal complexity grows with edge count. The interplay between retrieval latency, accuracy, and store size has not been rigorously characterized for long-lived production agents. Mem0's 91% latency reduction claim is measured against full-context baselines, not against other memory systems at comparable store sizes.

Privacy, GDPR, and the Right to Forget

Agent memory stores containing personal data fall under GDPR's scope. Users have the right to access what an agent remembers (Article 15), correct inaccurate memories (Article 16), and demand erasure (Article 17). The EU AI Act, fully applicable from August 2026, adds a 10-year audit trail requirement for high-risk AI systems — directly conflicting with GDPR erasure rights.

GDPR's purpose limitation principle — data may only be used for the specific purpose for which it was collected — is structurally incompatible with agents designed to recombine stored facts dynamically across contexts. New America's OTI brief (2025) identified this as a fundamental legal gap: current privacy law frameworks were not designed for systems that synthesize personal data across sessions and use it to shape future behavior.

Memory systems with user-level namespace scoping (Mem0's user/session/agent hierarchy) are better positioned for GDPR compliance than session-less vector stores, as they at least enable targeted deletion. But purpose limitation and cross-context recombination remain unsolved policy problems.

Evaluation Metrics

The field lacks a comprehensive evaluation framework. LoCoMo and LongMemEval measure conversational recall; neither captures procedural memory quality, cross-agent consistency, temporal reasoning accuracy in isolation, or resistance to poisoning. The "Anatomy of Agentic Memory" paper (arxiv 2602.19320, February 2026) presents a taxonomy and empirical analysis of evaluation limitations, noting that current benchmarks reward verbatim recall over reasoning quality.

Letta's benchmarking blog makes the provocative argument that a simple filesystem is competitive with sophisticated memory frameworks for many agent workloads — a finding that suggests current benchmarks may not test the capabilities that production agents actually need.

The Infinite Context Illusion

As frontier models push toward million-token context windows, the argument for external memory weakens in apparent simplicity but not in reality. Lost-in-the-middle attention degradation, quadratic compute cost, prompt caching fragility, and economic unsustainability at scale all persist regardless of nominal context length. The practical conclusion from 2025 production deployments is that external memory is not a workaround for limited context windows — it is the right architectural choice for agents that need persistent knowledge, and longer context windows are best understood as expanded working memory for complex single-session tasks, not as a substitute for a proper memory system.


Conclusion

Agent memory architecture in 2026 is a maturing field with clear patterns and persistent gaps. The episodic/semantic/procedural taxonomy provides a useful design framework that most production systems have now adopted, even if their implementations vary. Hybrid vector-graph storage has become the recommended backend for complex workloads, with pure vector approaches remaining viable for simpler use cases. Declarative memory injection via CLAUDE.md and AGENTS.md has proven its value as a lightweight, maintainable form of procedural memory for coding agents.

The most pressing open problems are not architectural but operational: memory security (poisoning, injection), legal compliance (GDPR, EU AI Act), and evaluation quality. These are not purely technical problems — they require legal frameworks, security standards, and evaluation infrastructure that the broader ecosystem is only beginning to build. Frameworks that treat memory as a first-class operational concern — with provenance tracking, access control, TTL policies, and erasure support built in — will be better positioned for enterprise adoption than those that optimize for retrieval accuracy alone.


Sources: