Episodic Memory and Narrative Coherence in AI Agents

Executive Summary

Persistent AI agents face a fundamental tension: context windows reset, yet the agent must behave as if it remembers everything. Semantic (fact-based) memory solves half the problem — it captures what an agent knows. Episodic memory captures what happened, in order, with context — allowing agents to reason about the arc of a relationship, not just isolated facts. Research in 2025–2026 has converged on three complementary advances: (1) multi-graph architectures that separate temporal, causal, semantic, and entity dimensions of memory; (2) biologically-inspired sleep-consolidation pipelines that compact raw experience into durable knowledge; and (3) production benchmarks that expose cross-session coherence as the critical unsolved frontier. This report covers the architectures, benchmarks, and engineering trade-offs.

The Episodic Memory Gap

Standard LLM agents store semantic facts ("user prefers dark mode") but not episodes ("on March 4 the user switched from light to dark mode because the new dashboard felt cluttered"). The difference matters for three classes of agent behavior:

Temporal reasoning: "Why did X happen?" and "When did you last do Y?" questions require sequenced event records, not static facts. Current SOTA models score between 0.204 and 0.290 on Chronological Awareness benchmarks — meaning agents often mis-sequence events they correctly recall as facts.

Narrative continuity: Long-running agents — personal assistants, coding partners, project managers — accumulate a history of a relationship. Losing the episodic arc causes agents to treat a recurring pattern as novel each session, repeat questions users have already answered, and miss causal connections that span weeks.

Adaptive behavior: Agents that remember outcomes of past tool calls, failed strategies, or user corrections can self-improve within their operational lifetime, without any model fine-tuning.

A February 2025 position paper ("Episodic Memory is the Missing Piece for Long-Term LLM Agents") identifies five required properties: (1) long-term persistence beyond the session, (2) explicit temporal reasoning over stored events, (3) single-shot learning from rare experiences, (4) instance-specific (not generalized) recall, and (5) contextual grounding that encodes when and how an event occurred.

The Benchmark Landscape

Three benchmarks now define the measurement space:

LoCoMo (Snap Research) is the primary episodic benchmark: conversations spanning up to 35 sessions, 300 turns, and 9,000 tokens. Questions are categorized as single-hop, multi-hop, temporal, and open-domain. The LLM-as-judge metric scores 0–1. As of early 2026, top systems score around 0.700; full-context (stuff-everything-in) approaches score 0.481.

LongMemEval (500 questions) emphasizes knowledge updates and multi-session tasks, measuring whether agents correctly recognize when a previously stored fact has become stale. Systems with a strong consolidation layer outperform raw-retrieval approaches by 5–10 points here.

BEAM scales to 1M–10M tokens and is designed for production-grade deployments. Performance on BEAM drops roughly 25% as memory grows from 1M to 10M tokens — a structural cliff that signals unsolved temporal abstraction challenges.

Current SOTA results achieve 92.5 on LoCoMo and 94.4 on LongMemEval while consuming only ~6,900 tokens per query, compared to the 26,000-token baseline of full-context approaches. The efficiency gain is as important as the accuracy gain: at production scale, a 4x token reduction compounds into significant cost and latency savings.

MAGMA: Multi-Graph Agentic Memory Architecture

MAGMA (January 2026) represents the most architecturally rigorous approach to episodic memory. Rather than a single vector store, it maintains four orthogonal graph views over the same event nodes:

Temporal graph: Strictly ordered chronological chains. Each event node carries a timestamp; edges encode "happened before" relationships. This is the backbone for all temporal reasoning.

Causal graph: Directed entailment edges inferred by a background LLM process. "User's recurring deadline stress (March) caused the request for automated weekly planning (April)" is an example causal edge that would not appear in a semantic similarity search.

Semantic graph: Undirected edges connecting conceptually similar events via embedding cosine similarity above a threshold. Enables retrieval of thematically related events even when they are far apart in time.

Entity graph: Events linked to abstract entity nodes, solving the object permanence problem — tracking how a user's project, a software component, or a relationship evolves across disjoint timelines.

Each event node stores: content, timestamp, dense embedding, and structured metadata. The query pipeline is intent-aware — it classifies the query type (Why / When / Entity), then steers traversal to weight causal edges for "why" questions and temporal edges for "when" questions. A Reciprocal Rank Fusion layer combines vector search, keyword matching, and temporal filtering to identify anchor nodes before traversal. Narrative synthesis then topologically orders the retrieved subgraph with provenance timestamps before injecting it into the context.

The dual-stream write architecture separates latency-sensitive ingestion (fast path: event segmentation, vector indexing, temporal backbone update, no LLM calls) from asynchronous structural consolidation (slow path: LLM-based causal and entity inference running in background). This ensures agent responsiveness is not blocked by memory writes.

LoCoMo results: MAGMA scores 0.700, outperforming A-MEM (0.580), Nemori (0.590), MemoryOS (0.553), and full context (0.481). Particularly strong on adversarial questions (0.742) and temporal reasoning (0.650). Ablation studies show that removing the adaptive traversal policy causes the largest single drop (0.700 → 0.637), confirming that intent-aware routing is the key differentiator — not just richer graph structure.

Token efficiency: On LongMemEval at 100K-token contexts, MAGMA achieves 95% token reduction (0.7–4.2K per query vs. 100K) while matching or exceeding full-context accuracy (61.2% vs. 55.0%).

HeLa-Mem: Hebbian Learning for Associative Recall

HeLa-Mem (April 2026) takes a different angle, focusing on relationship continuity — preserving the narrative arc of an evolving human-agent relationship rather than just temporal event ordering.

The core insight is Hebbian plasticity: "neurons that fire together, wire together." Conversation turns become nodes in a dynamic graph; edges strengthen each time two memories co-activate during retrieval. Over time, the graph develops hub nodes representing high-frequency association clusters — the agent's equivalent of well-worn mental pathways.

A meta-cognitive reflective agent runs periodically to identify hub nodes and apply "Hebbian Distillation": transforming episodic detail-clusters into structured semantic knowledge, analogous to how human episodic memories degrade in detail but strengthen in gist. Isolated, unused nodes are pruned by adaptive forgetting to prevent graph explosion.

Retrieval combines semantic similarity search with spreading activation — following learned Hebbian edges to retrieve memories that are semantically distant but strongly associated through accumulated experience.

On LoCoMo, HeLa-Mem achieves the best average rank of 1.25 across question categories (vs. MemoryOS at 2.25). Token efficiency is striking: ~1,010 tokens per query vs. 16,910 for full context. Multi-hop reasoning reaches 40.14% F1, and results hold consistently across four LLM backbones (GPT-4o-mini, GPT-4o, Qwen2.5 variants).

Sleep-Consolidated Memory: Biologically-Inspired Compaction

The SCM system (April 2026) implements a five-module architecture directly modeled on the brain's hippocampal–neocortical memory consolidation during sleep:

MeaningEncoder: Converts conversation text into typed semantic concepts using a local model (Llama 3.2)
ValueTagger: Assigns importance scores across four dimensions — novelty, emotional valence, task relevance, and repetition frequency
WorkingMemory: A limited buffer of 7 items (per Miller's Law) holding recent experiences in a fast-access store
LongTermMemory: A persistent semantic graph in NetworkX/SQLite
SleepCycle: An offline orchestrator that triggers on entropy spikes, conflicts, or elapsed time

The sleep phase has two sub-stages mirroring neuroscience: NREM (slow-wave) consolidation replays working memory episodes, strengthens co-occurring concepts via Hebbian plasticity, and applies 20% synaptic downscaling per cycle. REM consolidation performs random walks through the graph to generate novel associations. Intentional forgetting then prunes low-value-score nodes.

Benchmark results on standardized tests: 22/22 fact recall across ten-turn conversations; 90.9% noise reduction (45 of 50 noise concepts removed while preserving all signal concepts); sub-millisecond retrieval at 360 concepts; 24 concepts maintained vs. 72 for comparison systems with identical recall accuracy.

Anthropic shipped a production version of this concept — called "Dreaming" — on May 6, 2026 for Claude Managed Agents. It runs asynchronously between sessions: reviewing transcripts, extracting patterns, merging duplicates, and replacing stale entries. The system is explicitly modeled on hippocampal consolidation, treating off-peak periods as the agent's "sleep time" for memory maintenance.

E-mem: Multi-Agent Episodic Reconstruction

E-mem (January 2026) tackles a different variant of the problem: as agents accumulate long interaction histories, how do you reconstruct a coherent episode for a current task without loading the full history?

The approach is to use a collaborative multi-agent sub-system specifically for memory reconstruction. Specialized agents segment the history into coherent episodes based on task boundaries and contextual shifts, extract salient information from raw interaction sequences, and reconstruct the relevant context for the current query by identifying relationships between past experiences and present needs.

This is distinct from retrieval — it is active reconstruction, analogous to how humans don't retrieve memories verbatim but reconstruct them from fragments. The multi-agent design enables parallel processing of different memory facets. Results on HotpotQA show measurable gains in multi-hop reasoning when episodic reconstruction is applied.

Production Gaps Identified by Mem0 State-of-the-Field Report

The 2026 state-of-AI-agent-memory report from Mem0 identifies five critical unsolved problems despite benchmark progress:

Temporal abstraction at scale: Performance drops ~25% when scaling from 1M to 10M tokens of accumulated memory. Current architectures have no mechanism to abstract events at multiple temporal granularities — individual turns, daily summaries, weekly themes — in a unified, queryable form.

Cross-session identity: User identity breaks when users switch devices, use anonymous sessions, or interact via different channels. Memory systems built around stable user_id assumptions fail silently in these cases. Zylos's multi-channel architecture (Telegram, Lark, web console all sharing one memory layer) faces exactly this challenge.

Memory staleness: High-relevance memories can become confidently wrong when circumstances change. A user who "prefers late-night meetings" may now have young children. Systems that store facts without versioning or decay mechanisms will confidently contradict the user's current reality.

Cross-session structural modeling: Current systems replace facts rather than modeling information evolution. The difference between "user changed their mind" and "user's situation changed" is episodically significant but semantically invisible.

Privacy architecture: Consent, right-to-be-forgotten, and retention policies remain application-layer decisions with no industry standards. As episodic memory systems become richer and more personal, this gap becomes a product risk.

Architectural Patterns for Production Episodic Memory

Synthesizing across the systems reviewed, five design patterns emerge for production-grade episodic memory:

1. Separate episodic and semantic stores. Raw event records (episodic) and extracted, consolidated facts (semantic) should be distinct data structures with different update patterns. Episodic memory is append-only and temporally ordered; semantic memory is mutable and fact-centric. Conflating them leads to stale facts masquerading as episodic events.

2. Asynchronous consolidation. Background consolidation (MAGMA's slow path, SCM's sleep cycle, Anthropic's Dreaming) is the operationally correct pattern: never block agent interactions on expensive LLM-based memory structuring. Consolidation is idempotent and can be retried; interaction latency is not recoverable.

3. Intent-aware retrieval. Routing query intent before retrieval — distinguishing "why" questions (causal traversal), "when" questions (temporal traversal), and "what" questions (semantic retrieval) — consistently outperforms pure vector similarity. MAGMA's ablation confirms intent routing is the largest single contributor to accuracy.

4. Value-scored forgetting. Not all memories deserve equal retention. Scoring on novelty, task relevance, and recency and pruning low-scoring nodes is more effective than either unlimited growth (memory explosion) or sliding window eviction (losing important but infrequent events). SCM's four-dimensional ValueTagger is the clearest implementation of this.

5. Narrative synthesis over raw retrieval. Injecting ordered, provenance-annotated excerpts into context — rather than a raw bag of retrieved chunks — preserves narrative structure and dramatically reduces hallucination rates on episodic questions. MAGMA's Stage 4 narrative synthesis is the reference implementation.

Implications for Zylos

Zylos already has a tiered memory architecture (identity, state, references, sessions, archive) that maps closely to the episodic/semantic distinction. The current implementation stores session events in sessions/current.md and compacts them via the memory sync process — which is functionally a manual sleep-consolidation cycle.

The primary gap is in temporal structure: current memory does not model event sequences with causal links, making it impossible to answer "why did we decide X?" or "when did Howard last ask about Y?" without scanning raw session files. Adopting a lightweight temporal graph over session events — even a simple ordered event log with typed tags (decision, preference-change, task-completion, correction) — would enable the class of multi-hop episodic queries that current memory architecture cannot answer.

The cross-session identity problem is directly relevant given Zylos's multi-channel design. A user is the same person across Telegram, Lark, and web console, but the current memory model stores interactions without channel-normalized identity anchoring. As the user count and channel count grow, cross-channel identity resolution will become a prerequisite for coherent episodic recall.

Key Takeaways

Episodic memory (sequenced events with context) is architecturally distinct from semantic memory (consolidated facts) and required for temporal reasoning, narrative continuity, and adaptive behavior
MAGMA's four-graph architecture and intent-aware traversal defines the current state of the art, achieving 70% on LoCoMo vs. 48.1% for full-context approaches, while using 95% fewer tokens
Biologically-inspired sleep consolidation (SCM, Anthropic Dreaming) is an emerging production pattern for compacting raw experience into durable memory without blocking agent responsiveness
The five unsolved production gaps — temporal abstraction at scale, cross-session identity, memory staleness, structural modeling of fact evolution, and privacy architecture — define the research and engineering agenda for 2026–2027
Zylos's next memory architecture iteration should prioritize: (1) typed event tagging in session logs, (2) lightweight temporal ordering across sessions, and (3) channel-normalized identity anchoring for cross-channel episodic coherence