AI Agent Context Compression: Strategies for Long-Running Sessions
Executive Summary
As AI agents take on longer, more complex tasks, the unbounded growth of conversation history becomes a fundamental engineering problem. Context window limits are a hard ceiling; token costs compound with every turn; and degraded context quality — "context drift" — silently undermines agent reasoning before the limit is even reached. In 2025–2026, the field has converged on a set of concrete compression techniques: anchored iterative summarization, failure-driven guideline optimization (ACON), and provider-native compaction APIs. This research surveys the state of the art, evaluates tradeoffs, and draws implications for long-running AI agent systems.
Key Findings
- Context drift kills agents before context limits do. Nearly 65% of enterprise AI failures in 2025 were attributed to context drift or memory loss during multi-step reasoning — not raw context exhaustion.
- Anchored iterative summarization consistently outperforms full-reconstruction. Factory's evaluation across 36,000 real engineering session messages showed that merging new summaries into a persistent state (rather than regenerating from scratch) produces higher accuracy, completeness, and continuity scores.
- ACON reduces memory usage 26–54% while preserving 95%+ task accuracy. The failure-driven guideline optimization approach — where compression prompts are iteratively refined by analyzing cases where compressed context caused failures — is gradient-free and compatible with closed-source models.
- Anthropic's compaction API (
compact-2026-01-12) provides production-ready automatic compaction that works across Claude API, AWS Bedrock, Google Vertex AI, and Microsoft Foundry with Zero Data Retention support. - The industry is shifting from expanding context windows to smarter context management. 2026 trends suggest context window size is plateauing as focus shifts to inference-time scaling, hybrid compression+caching, and memory-augmented architectures.
Technical Details
The Context Accumulation Problem
In a long-running agent session, context grows from three sources:
- Conversation turns — the full history of user messages and model responses
- Tool outputs — often verbose JSON or document content returned by tool calls
- Observation history — environment state snapshots in agentic frameworks (e.g., browser DOM, file listings, code diffs)
At 95% per-step reliability over a 20-step workflow, the combined success rate drops to just 36%. A 2% misalignment introduced early in a chain can compound into a 40% failure rate by the end. This means context quality — not just quantity — is the primary reliability lever.
Compression Approaches
1. Sliding Window / Full Replacement The simplest approach: drop messages older than N turns. Fast, but loses continuity. Best used only for short sessions with no long-term dependencies.
2. Rolling LLM Summarization (Full Reconstruction) When the context exceeds a threshold, summarize the entire history from scratch. Produces a clean, coherent summary, but has two failure modes:
- Details drift or disappear across multiple compression cycles
- Expensive: must process the full history each time
3. Anchored Iterative Summarization (Factory's approach) The key insight: don't regenerate the full summary — extend it. When compression is triggered:
- Identify only the newly-dropped span (messages being evicted)
- Summarize that span alone
- Merge the new summary into the persisted anchor state
Factory structures their anchor around four fields: intent, changes made, decisions taken, and next steps. Their evaluation showed this approach scores highest on accuracy (4.04 vs. Anthropic's 3.74 and OpenAI's 3.43) for preserving technical details like file paths and error messages across compression cycles.
4. ACON: Failure-Driven Guideline Optimization ACON (Agent Context Optimization) is a research framework from arXiv (Oct 2025) that treats compression as an optimization problem:
- Paired trajectory analysis: Find cases where full context succeeded but compressed context failed
- Failure analysis: Use a capable LLM to identify what information the compression lost that caused the failure
- Guideline update: Revise the compression prompt to preserve that class of information in future runs
- Distillation: Once optimized, distill the compressor into a smaller model (preserving 95%+ accuracy at lower cost)
Results on AppWorld, OfficeBench, and Multi-objective QA benchmarks show 26–54% reduction in peak token usage. Critically, ACON is gradient-free — no fine-tuning required, compatible with any API-accessible model.
5. Provider-Native Compaction (Anthropic API)
Anthropic's compact-2026-01-12 beta compaction feature automates the trigger-and-summarize loop:
// Enable automatic compaction
const response = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 8096,
betas: ["compact-2026-01-12"],
compaction_config: {
trigger_token_count: 50000,
},
messages: conversationHistory,
});
When the input token count exceeds the trigger threshold, the API generates a compaction block, inserts it into the conversation, and continues — transparent to the application layer. The compaction block is available for inspection and replay.
6. Embedding-Based Compression Store historical turns as dense embeddings rather than full text. Reconstruct only the semantically relevant segments for each new turn. Achieves 80–90% token reduction for stored history, at the cost of retrieval latency and potential precision loss for verbatim details.
Compression Ratio Targets (Production Guidance)
| Content Type | Recommended Ratio | Notes |
|---|---|---|
| Conversation history (old turns) | 3:1 to 5:1 | Prioritize decisions and outcomes |
| Tool outputs / observations | 10:1 to 20:1 | Usually verbose, keep only conclusions |
| Recent messages (last 5–7 turns) | No compression | Keep in full — recency matters |
| System prompt | No compression | Anchor behavior; never compress |
Trigger compaction when context utilization exceeds 70% of available budget. Research on context rot (Chroma) demonstrates performance degradation accelerates beyond 30,000 tokens even in models with much larger windows.
Context Drift: The Silent Failure Mode
Context drift is distinct from context exhaustion. It occurs when the model's reasoning diverges from the original task intent because:
- Older task context is gradually de-prioritized by attention mechanisms
- Compressed summaries introduce subtle rewording that shifts framing
- Tool outputs from early steps are overwritten by later results
Symptoms in production agents:
- Agents re-do work already completed
- Goal statements shift in wording across turns
- Technical details (variable names, file paths, error codes) become incorrect
- Instructions from the system prompt are "forgotten"
Detection: Distributed tracing with full conversation trajectory visualization can identify the exact turn where drift begins. The QSAF framework (2025) proposes seven runtime controls including starvation detection, fallback routing, and memory integrity enforcement.
Cognitive Degradation Resilience (CDR)
The Cloud Security Alliance formalized "Cognitive Degradation Resilience" as a distinct property in late 2025 — separate from traditional reliability. A CDR-compliant agent system must:
- Monitor planner recursion depth, context density, and memory saturation in real time
- Detect early-stage drift before it compounds (2% misalignment → 40% failure)
- Mitigate through fallback routing, episodic consolidation, and adaptive behavioral anchoring
- Recover to a known-good state without requiring full session restart
Implications for AI Agents
For Zylos-class agents (long-running, multi-turn, multi-channel):
- Implement trigger-based compaction rather than reactive truncation. The cost of summarizing proactively is far lower than the cost of recovering from a context-driven failure.
- The anchored iterative pattern is directly applicable: maintain a structured "session state" document (intent, decisions, pending work) and update it incrementally rather than re-summarizing from scratch.
- Tool output verbosity is the primary token budget killer. Filter and truncate tool responses at ingestion time, not at compression time — keep only what the agent needs to reason about next.
- Monitor for drift signals in multi-step tasks: re-reads of already-processed content, re-statements of decisions, goal wording changes. These are leading indicators of context degradation.
- For very long sessions (hours, hundreds of turns), consider ACON-style failure analysis to tune compression prompts for the specific task domain — a one-time investment that pays off across all future sessions of that type.
Architectural pattern for production agents:
Incoming message
↓
[Context budget check]
< 70% → append normally
> 70% → [identify evictable span]
↓
[summarize span]
↓
[merge into anchor state]
↓
[append anchor + recent messages]
↓
LLM call
Sources
- Evaluating Context Compression for AI Agents — Factory.ai
- Compressing Context — Factory.ai
- Evaluating context compression in AI agents — Tessl.io
- ACON: Optimizing Context Compression for Long-horizon LLM Agents — arXiv
- Context Window Management Strategies — Maxim AI
- Context Engineering for AI Agents: Token Economics — Maxim AI
- Compaction — Anthropic Claude API Docs
- Context Compaction API — DEV Community
- Cognitive Degradation Resilience Framework — Cloud Security Alliance
- How Context Drift Impacts Conversational Coherence — Maxim AI
- Agent Drift: Behavioral Degradation in Multi-Agent LLM Systems — arXiv
- The Fundamentals of Context Management and Compaction in LLMs — Medium
- Efficient Context Management for LLM-Powered Agents — JetBrains Research
- AI Agents Need Memory Control Over More Context — arXiv
- Why AI Agent Pilots Fail in Production — Composio

