Context Window Management and Session Lifecycle for Long-Running AI Agents

Executive Summary

Long-running AI agents face a structural challenge that short-lived chatbots never encounter: the context window is finite, but the work is not. A coding agent tasked with building an application, an always-on assistant serving a team across days, or a research agent gathering information over hours will inevitably exhaust its working memory. How production systems handle this constraint — gracefully, without losing state or confusing the agent — has become one of the defining engineering problems of the agentic era.

Research from Anthropic, JetBrains, and the SWE-agent team converges on a counterintuitive insight: raw context size matters less than context quality. Context rot — the measurable degradation in model performance as context grows — begins well before the token limit. Chroma's 2025 study of 18 frontier models found degradation at every increment of context growth, with a "lost-in-the-middle" effect causing 30%+ accuracy drops for information buried in the middle of long conversations. The practical implication is that intelligent systems must manage context proactively, not reactively.

The field has converged on a tiered approach: compaction (summarizing the conversation while preserving architectural decisions), structured external memory (progress files, checkpoints, databases), and sub-agent delegation (offloading bounded subtasks that return only compact summaries). These are not mutually exclusive; the most robust production systems combine all three. Anthropic's own work on Claude Code found that with capable enough models, multi-session handoff complexity could be dropped entirely — compaction alone was sufficient to sustain continuity.

Priority and interruption management represent an underappreciated dimension. Agents embedded in real-world pipelines receive a mix of user requests, system signals, and maintenance tasks through a single conversation stream. Patterns for non-interrupting control injection, cooldown strategies, and context-aware deduplication are now essential components of any production agent harness.

The Context Rot Problem

Before examining solutions, it is worth understanding why context management is harder than it appears.

Modern frontier models advertise context windows of 200K–1M tokens, which might suggest the problem is solved by default. In practice, performance degrades substantially before those limits are reached. Chroma's systematic research found that a model with a 200K token window can exhibit significant degradation at 50K tokens. Three mechanisms drive this:

Lost-in-the-middle effect: Transformers attend well to the start and end of context but poorly to the middle, causing accuracy to fall for information in the interior of a long conversation. The U-shaped attention pattern described by Liu et al. (2023) holds when context is less than 50% full; beyond that threshold, models increasingly favor recent tokens, then middle tokens, over earlier content.

Attention dilution: Transformer attention is quadratic over sequence length. At 100K tokens, the model is managing roughly 10 billion pairwise relationships. The signal-to-noise ratio for any given piece of information drops as the window fills.

Distractor interference: Semantically similar but irrelevant content actively misleads the model. A conversation that has touched many topics provides many false candidate answers for retrieval-style tasks.

Paulsen (2025) further showed that context degradation is not limited to "needle-in-a-haystack" tasks; it affects a wide variety of reasoning tasks, and kicks in at lower token counts for more complex tasks. The practical threshold for proactive management is well under 50% of nominal context capacity.

Context Management Strategies: A Taxonomy

Production systems have developed three primary responses to context limits, each with distinct tradeoffs.

Truncation

The simplest approach: drop the oldest messages when token counts approach a threshold. Claude Sonnet 4.5 uses this natively — when approaching the context limit, it automatically drops the oldest tool outputs and interactions, preserving the dialogue flow. Truncation is computationally cheap but lossy. It works well for conversational agents where earlier turns are genuinely less relevant, but fails for task-oriented agents where early decisions constrain later work.

Compaction and Summarization

Compaction takes a conversation approaching the context limit, summarizes its contents, and reinitializes a new context window with the summary. Codex CLI formalized this with a dedicated compaction item type that preserves the model's understanding of the original conversation. Claude Code exposes /compact for user-initiated compaction and runs automatic compaction when the auto_compact_limit is exceeded.

The JetBrains research team's 2025 study compared two summarization strategies across 250-turn agent trajectories on SWE-bench Verified:

Observation masking: Replace older environment observations (file contents, command output) with placeholders while preserving reasoning and actions intact — a rolling window approach used in SWE-agent
LLM summarization: Use a separate model to compress historical interactions into a narrative summary — used in OpenHands, Cursor, and Warp

Both reduced costs by over 50% versus unmanaged contexts. Counterintuitively, observation masking often matched or exceeded LLM summarization in solve rate. With Qwen3-Coder 480B, masking achieved 2.6% higher solve rates while being 52% cheaper. The reason: LLM summarization inadvertently extended agent trajectories by 13–15% by obscuring natural stopping signals. The optimal approach combines observation masking as the primary mechanism with selective summarization for genuinely complex historical state.

For anchored iterative summarization (converged on by Factory, Anthropic, and others), the structure centers on four fields: intent, changes made, decisions taken, and next steps. This schema scores highest on accuracy for preserving technical details like file paths and error messages across compression cycles.

External Memory and Checkpointing

The most robust approach stores critical state outside the context window entirely, making it available on-demand rather than always loaded. This is how human teams manage long projects: not by remembering everything, but by maintaining organized records.

Anthropic's harness work for multi-session agents used a claude-progress.txt file as a persistent narrative bridge. Each session begins by reading this file rather than parsing the full git history. The pattern explicitly captures:

What was recently worked on
What was left in an incomplete state
What the next session should prioritize

LangGraph's checkpointing system formalizes this at the framework level: snapshots of graph state are saved at every execution step, organized into threads, enabling fault-tolerant execution and human-in-the-loop workflows. Long-term memory lives in a separate store (JSON documents organized by namespace and key), persisting across sessions independently of context.

A key architectural principle from the DEV Community's state handoff research: distinguish between persistence (storing data) and handoffs (communicating between sessions). A database stores facts; a handoff tells a story. Effective handoffs include five layers:

State snapshot — typed, validated current values
Narrative context — 3-5 sentences explaining why the state looks as it does
Decision log — what was decided, deferred, and why
Priority queue — what the next session should do first, second, and third
Warnings and gotchas — institutional knowledge about rate limits and environmental issues

Session Lifecycle Patterns

The Warm vs. Cold Start Spectrum

A cold start is a session that begins with no prior state: the agent must infer context from scratch, re-discover what was done, and reconstruct its working model of the task. A warm start is a session that begins with a structured handoff, loading compressed state and immediately knowing its priorities.

Cold starts are catastrophic for task-oriented agents. Anthropic's multi-session experiment found that agents encountering broken states left by previous sessions would "spend substantial time trying to get the basic app working again." Worse, a second failure mode emerged: a later agent instance would survey progress, see that work had been done, and prematurely declare the task complete.

The initializer-coding agent pattern Anthropic developed addresses this directly:

An initializer agent (first session only) creates the environment: feature lists in JSON, a reproducible init.sh, git repository structure, and the claude-progress.txt narrative file
A coding agent (all subsequent sessions) reads progress artifacts before starting, works on one feature at a time, commits changes with descriptive messages, and updates documentation before exiting

Each session follows a standardized startup sequence:

Establish working directory with pwd
Read progress log and recent git history
Execute init.sh to start the development server
Run end-to-end tests to detect undocumented bugs from the prior session
Select the next incomplete feature from the prioritized list

Git serves as a recovery mechanism: descriptive commits allow agents to revert bad changes and restore working states when problems arise mid-session.

Context Monitoring and Proactive Rotation

Rather than waiting for context exhaustion, production systems monitor usage in real-time and trigger rotation proactively. Two-stage monitoring has emerged as a best practice:

Early warning (e.g., 64% usage): Trigger memory sync — write important state to external storage while the agent still has enough context to do so coherently
Session switch threshold (e.g., 80% usage): Initiate graceful handoff to a fresh session

The gap between stages is critical: it gives the memory sync process time to complete before the session must end. Triggering both at the same threshold creates a race condition where the session may exhaust context mid-sync.

Cooldown periods prevent redundant signals when context usage fluctuates near a threshold. If context temporarily dips below 80% after a tool output is cleared, the system should not re-trigger session rotation. Deduplication logic (tracking in-flight sync/rotation operations) prevents multiple concurrent handoffs from the same monitoring event.

Production Implementations

Claude Code

Claude Code's 200K token context window is split across conversation history, file reads, tool outputs, CLAUDE.md instructions, and auto-loaded memory. The system monitors token usage in real-time and offers both automatic compaction (at the auto_compact_limit) and manual /compact commands. The /clear command starts a completely fresh context, preferred when switching between unrelated tasks.

Session naming (--name flag, introduced in v2.1.76) enables resuming named sessions across invocations. Auto memory persists learned facts across sessions independently of the conversation context. CLAUDE.md provides persistent instructions without consuming per-turn tokens.

Anthropic's internal research found they could drop multi-session handoff complexity entirely with Opus 4.6, relying on compaction alone — a signal that improving base model capability reduces the engineering burden of context management.

Codex CLI (OpenAI)

Codex CLI addresses context in two phases. Early versions required manual /compact invocation. The current implementation uses an encrypted compaction item type in the conversation, preserving latent model understanding across compression cycles. Automatic compaction triggers when the auto_compact_limit is exceeded. Codex also supports MCP server mode, allowing it to run as a long-lived process orchestrated by the Agents SDK rather than as an invocation-per-task CLI.

Windsurf (Cascade)

Windsurf's Cascade agent uses a "Flows" model for persistent session context, combining RAG-based automatic context retrieval (the Fast Context indexing engine) with integrated terminal access. Rather than requiring manual @mention of relevant files (as Cursor does), Windsurf indexes the entire codebase semantically and pulls relevant snippets automatically. This trades explicit control for reduced cognitive overhead, and works well for large enterprise codebases where knowing which files are relevant is itself a non-trivial problem.

Cursor

Cursor takes the opposite approach: explicit, manual context management via @mention syntax for files, folders, and documentation. The practical context window is 10K–50K tokens depending on the model. Background Agents run in isolated VMs on separate branches, enabling true parallelism and multi-agent task distribution — a different approach to the long-running problem that avoids context limits by keeping individual agent scopes small.

SWE-agent and Open SWE

SWE-agent pioneered the Agent-Computer Interface (ACI) pattern: specialized commands for file navigation, editing, and search designed to minimize context overhead. Older observations (beyond the last 5 turns) are collapsed to single-line summaries rather than retained verbatim. Open SWE (LangChain) extends this with asynchronous operation and pre-loaded context from issue trackers, Slack threads, and PRs before the coding loop begins — reducing in-session discovery overhead and the tool calls that would otherwise consume context.

LangGraph

LangGraph provides checkpointing as a first-class primitive: every node execution saves a state snapshot, enabling time-travel debugging, human-in-the-loop interruption, and fault recovery. Long-term memory lives in a separate store with namespace/key organization, persisting across sessions and shared across parallel threads. The Platform (formerly LangGraph Platform, renamed LangSmith Deployment in October 2025) adds infrastructure-level checkpointing optimized for production scale.

Priority and Interruption Management

Always-on agents embedded in communication pipelines receive a heterogeneous stream of inputs: user requests, scheduled maintenance triggers, heartbeat checks, and system health signals. Managing these through a single conversation introduces priority and interruption challenges.

Non-Interrupting Control Injection

A key pattern: system signals should not interrupt user-facing work but must still be processed. One implementation uses a prefix convention — messages beginning with "Meanwhile, " are treated as background control signals that can be acknowledged without disrupting active task execution. The agent processes the signal, acknowledges it, and continues the primary task without context disruption.

Priority Levels and Task Queues

The LogRocket task queue pattern assigns explicit priority levels:

User-facing requests get high priority and drain first
Background summarization, memory sync, and health checks get normal priority
Maintenance and cleanup tasks get low priority

This prevents background work from degrading user experience, and allows a new user request to preempt a background summarization that was triggered moments before.

Deduplication and Cooldown

Monitoring loops can fire multiple signals before the first is acted upon — for instance, a context usage check that fires every 30 seconds may detect high usage three times before the first sync completes. Deduplication logic (a flag tracking in-flight operations) ensures only one sync runs at a time. Cooldowns prevent re-triggering for a fixed window after the threshold was first crossed, allowing usage to stabilize before evaluating again.

Memory Persistence Strategies

What to Persist

Not all state is worth persisting. The most impactful external memory stores contain:

Decisions and rationale: Why a particular approach was chosen, what alternatives were rejected — prevents re-litigating settled questions
Current task and blockers: What is actively being worked on and what is blocking progress
Environment configuration: How to reproduce the working state (the equivalent of init.sh)
Priority queue: The ordered list of next steps, explicitly authored by the agent at session end

Information not worth persisting: raw tool outputs, intermediate search results, temporary file contents. These can be retrieved on-demand and their bulk degrades summarization quality.

Tiered Memory Architecture

Production systems converge on a hierarchical architecture:

In-context (working memory): The active conversation window. Fast, volatile. Used for current reasoning.

Session-scoped external memory: Files like claude-progress.txt, written during the session and read at session start. Persists across context resets within a project.

Long-term semantic memory: Vector stores, databases, or structured files containing facts, user preferences, and historical decisions. Queried on-demand rather than bulk-loaded.

Archive: Cold storage for historical sessions, rarely accessed but available for deep historical lookups.

Cost Optimization

Smart memory systems can reduce token costs by 80–90% while improving response quality by 26% versus basic chat history management. Key levers:

Semantic caching: Identical or near-identical prompts return cached responses, eliminating 20–40% of API calls for repetitive traffic
Token compression: Techniques like LLMLingua-2 compress prompts up to 5x by removing redundant tokens while preserving semantics
Selective loading: Load only relevant memory tiers for each session type, rather than pre-populating the full context with everything known

The most expensive pattern is the "sequential memory" anti-pattern: sending the entire conversation history to the model on every turn. Even modest summarization or sliding-window approaches dramatically improve both cost and quality.

Key Principles for System Builders

Drawing across the sources surveyed, several principles apply broadly:

Manage context proactively, not reactively. Context rot begins long before the limit. Set thresholds at 60–70% of nominal capacity for early warning, and initiate rotation before 80%.

Separate memory sync from session switching. These are different operations with different failure modes. Triggering both simultaneously creates race conditions. Stage them with a buffer window.

Invest in handoff quality, not just storage. A good handoff is a narrative, not a data dump. The next session needs to know what matters right now, not just what happened.

Use external files as narrative bridges. claude-progress.txt, JSON feature lists, structured decision logs — these cost almost nothing and dramatically reduce cold-start confusion.

Design for observation masking before summarization. JetBrains' research shows masking is cheaper and often more effective than LLM-based summarization. Apply summarization selectively for truly complex historical state.

Non-interrupting control channels prevent cascading disruption. System signals should be designed to be processed asynchronously without interrupting user-facing tasks.

Test handoffs with actual restarts, not simulated loads. Real handoff failures often surface only under actual restart conditions, not during development testing.

References

Chroma Research, "Context Rot: How Increasing Input Tokens Impacts LLM Performance" — https://research.trychroma.com/context-rot
Anthropic Engineering, "Effective harnesses for long-running agents" — https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
Anthropic Engineering, "Effective context engineering for AI agents" — https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
JetBrains Research, "Cutting Through the Noise: Smarter Context Management for LLM-Powered Agents" — https://blog.jetbrains.com/research/2025/12/efficient-context-management/
Aureus / DEV Community, "Building Reliable State Handoffs Between AI Agent Sessions" — https://dev.to/aureus_c_b3ba7f87cc34d74d49/building-reliable-state-handoffs-between-ai-agent-sessions-1bk3
Mem0, "LLM Chat History Summarization Guide 2025" — https://mem0.ai/blog/llm-chat-history-summarization-guide-2025
OpenAI, "Unrolling the Codex agent loop" — https://openai.com/index/unrolling-the-codex-agent-loop/
Maxim AI, "Context Window Management: Strategies for Long-Context AI Agents and Chatbots" — https://www.getmaxim.ai/articles/context-window-management-strategies-for-long-context-ai-agents-and-chatbots/
LangChain, "LangGraph Platform is now Generally Available" — https://blog.langchain.com/langgraph-platform-ga/
LangChain, "Introducing Open SWE: An Open-Source Asynchronous Coding Agent" — https://blog.langchain.com/introducing-open-swe-an-open-source-asynchronous-coding-agent/
SWE-agent, "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering" (NeurIPS 2024) — https://arxiv.org/pdf/2405.15793
Cuckoo AI Network, "Agent System Architectures of GitHub Copilot, Cursor, and Windsurf" — https://cuckoo.network/blog/2025/06/03/coding-agent
Redis, "AI Agent Memory: Build Stateful AI Systems That Remember" — https://redis.io/blog/ai-agent-memory-stateful-systems/
LogRocket Blog, "Why your AI agent needs a task queue (and how to build one)" — https://blog.logrocket.com/ai-agent-task-queues/
Daily Dose of DS, "A Practical Deep Dive Into Memory Optimization for Agentic Systems" — https://www.dailydoseofds.com/ai-agents-crash-course-part-16-with-implementation/
DeepWiki, "Context Window & Compaction — Claude Code Session and Conversation Management" — https://deepwiki.com/anthropics/claude-code/3.3-session-and-conversation-management