AI Agent Context Compression: Strategies for Long-Running Sessions

Executive Summary

As AI agents take on longer, more complex tasks, the unbounded growth of conversation history becomes a fundamental engineering problem. Context window limits are a hard ceiling; token costs compound with every turn; and degraded context quality — "context drift" — silently undermines agent reasoning before the limit is even reached. In 2025–2026, the field has converged on a set of concrete compression techniques: anchored iterative summarization, failure-driven guideline optimization (ACON), and provider-native compaction APIs. This research surveys the state of the art, evaluates tradeoffs, and draws implications for long-running AI agent systems.

Key Findings

Context drift kills agents before context limits do. Nearly 65% of enterprise AI failures in 2025 were attributed to context drift or memory loss during multi-step reasoning — not raw context exhaustion.
Anchored iterative summarization consistently outperforms full-reconstruction. Factory's evaluation across 36,000 real engineering session messages showed that merging new summaries into a persistent state (rather than regenerating from scratch) produces higher accuracy, completeness, and continuity scores.
ACON reduces memory usage 26–54% while preserving 95%+ task accuracy. The failure-driven guideline optimization approach — where compression prompts are iteratively refined by analyzing cases where compressed context caused failures — is gradient-free and compatible with closed-source models.
Anthropic's compaction API (compact-2026-01-12) provides production-ready automatic compaction that works across Claude API, AWS Bedrock, Google Vertex AI, and Microsoft Foundry with Zero Data Retention support.
The industry is shifting from expanding context windows to smarter context management. 2026 trends suggest context window size is plateauing as focus shifts to inference-time scaling, hybrid compression+caching, and memory-augmented architectures.

Technical Details

The Context Accumulation Problem

In a long-running agent session, context grows from three sources:

Conversation turns — the full history of user messages and model responses
Tool outputs — often verbose JSON or document content returned by tool calls
Observation history — environment state snapshots in agentic frameworks (e.g., browser DOM, file listings, code diffs)

At 95% per-step reliability over a 20-step workflow, the combined success rate drops to just 36%. A 2% misalignment introduced early in a chain can compound into a 40% failure rate by the end. This means context quality — not just quantity — is the primary reliability lever.

Compression Approaches

1. Sliding Window / Full Replacement The simplest approach: drop messages older than N turns. Fast, but loses continuity. Best used only for short sessions with no long-term dependencies.

2. Rolling LLM Summarization (Full Reconstruction) When the context exceeds a threshold, summarize the entire history from scratch. Produces a clean, coherent summary, but has two failure modes:

Details drift or disappear across multiple compression cycles
Expensive: must process the full history each time

3. Anchored Iterative Summarization (Factory's approach) The key insight: don't regenerate the full summary — extend it. When compression is triggered:

Identify only the newly-dropped span (messages being evicted)
Summarize that span alone
Merge the new summary into the persisted anchor state

Factory structures their anchor around four fields: intent, changes made, decisions taken, and next steps. Their evaluation showed this approach scores highest on accuracy (4.04 vs. Anthropic's 3.74 and OpenAI's 3.43) for preserving technical details like file paths and error messages across compression cycles.

4. ACON: Failure-Driven Guideline Optimization ACON (Agent Context Optimization) is a research framework from arXiv (Oct 2025) that treats compression as an optimization problem:

Paired trajectory analysis: Find cases where full context succeeded but compressed context failed
Failure analysis: Use a capable LLM to identify what information the compression lost that caused the failure
Guideline update: Revise the compression prompt to preserve that class of information in future runs
Distillation: Once optimized, distill the compressor into a smaller model (preserving 95%+ accuracy at lower cost)

Results on AppWorld, OfficeBench, and Multi-objective QA benchmarks show 26–54% reduction in peak token usage. Critically, ACON is gradient-free — no fine-tuning required, compatible with any API-accessible model.

5. Provider-Native Compaction (Anthropic API) Anthropic's compact-2026-01-12 beta compaction feature automates the trigger-and-summarize loop:

// Enable automatic compaction
const response = await client.messages.create({
  model: "claude-opus-4-6",
  max_tokens: 8096,
  betas: ["compact-2026-01-12"],
  compaction_config: {
    trigger_token_count: 50000,
  },
  messages: conversationHistory,
});

When the input token count exceeds the trigger threshold, the API generates a compaction block, inserts it into the conversation, and continues — transparent to the application layer. The compaction block is available for inspection and replay.

6. Embedding-Based Compression Store historical turns as dense embeddings rather than full text. Reconstruct only the semantically relevant segments for each new turn. Achieves 80–90% token reduction for stored history, at the cost of retrieval latency and potential precision loss for verbatim details.

Compression Ratio Targets (Production Guidance)

Content Type	Recommended Ratio	Notes
Conversation history (old turns)	3:1 to 5:1	Prioritize decisions and outcomes
Tool outputs / observations	10:1 to 20:1	Usually verbose, keep only conclusions
Recent messages (last 5–7 turns)	No compression	Keep in full — recency matters
System prompt	No compression	Anchor behavior; never compress

Trigger compaction when context utilization exceeds 70% of available budget. Research on context rot (Chroma) demonstrates performance degradation accelerates beyond 30,000 tokens even in models with much larger windows.

Context Drift: The Silent Failure Mode

Context drift is distinct from context exhaustion. It occurs when the model's reasoning diverges from the original task intent because:

Older task context is gradually de-prioritized by attention mechanisms
Compressed summaries introduce subtle rewording that shifts framing
Tool outputs from early steps are overwritten by later results

Symptoms in production agents:

Agents re-do work already completed
Goal statements shift in wording across turns
Technical details (variable names, file paths, error codes) become incorrect
Instructions from the system prompt are "forgotten"

Detection: Distributed tracing with full conversation trajectory visualization can identify the exact turn where drift begins. The QSAF framework (2025) proposes seven runtime controls including starvation detection, fallback routing, and memory integrity enforcement.

Cognitive Degradation Resilience (CDR)

The Cloud Security Alliance formalized "Cognitive Degradation Resilience" as a distinct property in late 2025 — separate from traditional reliability. A CDR-compliant agent system must:

Monitor planner recursion depth, context density, and memory saturation in real time
Detect early-stage drift before it compounds (2% misalignment → 40% failure)
Mitigate through fallback routing, episodic consolidation, and adaptive behavioral anchoring
Recover to a known-good state without requiring full session restart

Implications for AI Agents

For Zylos-class agents (long-running, multi-turn, multi-channel):

Implement trigger-based compaction rather than reactive truncation. The cost of summarizing proactively is far lower than the cost of recovering from a context-driven failure.
The anchored iterative pattern is directly applicable: maintain a structured "session state" document (intent, decisions, pending work) and update it incrementally rather than re-summarizing from scratch.
Tool output verbosity is the primary token budget killer. Filter and truncate tool responses at ingestion time, not at compression time — keep only what the agent needs to reason about next.
Monitor for drift signals in multi-step tasks: re-reads of already-processed content, re-statements of decisions, goal wording changes. These are leading indicators of context degradation.
For very long sessions (hours, hundreds of turns), consider ACON-style failure analysis to tune compression prompts for the specific task domain — a one-time investment that pays off across all future sessions of that type.

Architectural pattern for production agents:

Incoming message
       ↓
[Context budget check]
  < 70% → append normally
  > 70% → [identify evictable span]
               ↓
          [summarize span]
               ↓
          [merge into anchor state]
               ↓
  [append anchor + recent messages]
       ↓
LLM call