Long-Horizon Agent Goal Persistence: Cross-Session Continuity and Multi-Day Task Architecture

Executive Summary

The dominant limitation of AI agents is not intelligence — it is continuity. A single context window can hold a few hundred thousand tokens; a multi-day engineering project holds millions of decisions, commits, file states, and intermediate results. The gap between what a model can hold and what a long-horizon task demands defines one of the central engineering challenges of 2026.

This research examines how the industry is closing that gap: through structured goal state persistence, cross-session memory architectures, artifact-based context bridges, and new runtime primitives like OpenAI Codex's /goal command. The findings reveal a field rapidly maturing from demo-friendly single-session tasks toward genuinely autonomous, multi-day work — but one still wrestling with fundamental reliability challenges as task duration increases.

The Core Problem: Session Boundaries as a Reliability Cliff

Every AI agent session begins with a clean context window. Without deliberate design, this means:

Prior decisions are invisible to the new session
Partially completed work may be redone or contradicted
Goals stated in session N may drift or narrow in session N+1
Test results and known failure modes vanish between runs

Research from production deployments quantifies this: every agent experiences a success rate decrease after 35 minutes of continuous operation, and doubling task duration quadruples the failure rate. At 100-step trajectories, reliability degrades significantly compared to 10-step tasks due to error accumulation, goal drift, and context degradation.

The field does not yet have principled solutions to these problems — only mitigation patterns. Understanding those patterns is the current state of the art.

Architectural Patterns for Cross-Session Continuity

Pattern 1: Artifact-Based Memory (The "Ralph Pattern")

The simplest and often most reliable approach uses external artifacts as the persistent context bridge. Pioneered in practice and later systematized by Anthropic's engineering team, it relies on:

Git history as an append-only log of decisions and completed work
Progress files (e.g., claude-progress.txt) as human-readable session handoff notes
Feature manifests as structured JSON lists with binary pass/fail status per item
Session initialization rituals: each new session reads git log, progress files, and runs smoke tests before beginning any new work

This approach's surprising strength is its simplicity: git is already the standard for code artifact persistence, and progress files are readable by both agents and humans. Its weakness is that the agent must write high-quality notes at session end — which itself requires reliable shutdown procedures and cannot be assumed.

Anthropic's published harness design uses a two-agent architecture:

An Initializer Agent runs once, creates the foundational scaffolding (init scripts, progress file structure, initial commit)
A Coding Agent runs in every subsequent session, consumes the scaffolding, and appends to it

This separation ensures session N+1 always has a well-structured starting point, regardless of how session N ended.

Pattern 2: Durable Goal Objects

OpenAI's Codex CLI /goal command, released in April 2026 (v0.128.0+), introduced a new runtime primitive: the persistent goal object. Unlike a session prompt (which disappears when the session ends), a goal is stored at the thread level and survives:

Connection loss
Token budget exhaustion mid-task
Deliberate pauses
Complete session termination

Technically, a goal is distinguished from a plan. A plan is a sequence of steps toward the goal; plans are disposable. The goal itself — the durable objective — survives plan failures and triggers replanning. When a plan fails partway through, the agent devises a new plan while the goal remains active.

The goal lifecycle inside Codex includes: create, pause, resume, and clear. Persistence is implemented at the thread level with a dedicated persistence layer. This shifts the interaction model from "answer this prompt" to "pursue this outcome" — a meaningful architectural step toward multi-day autonomous work.

Pattern 3: Structured Memory Graphs

Research published in early 2026 demonstrates that flat, similarity-based memory retrieval degrades sharply on long-horizon tasks. Two notable systems address this:

MAGMA (Multi-Graph Agentic Memory Architecture, arXiv:2601.03236, accepted at ACL 2026) represents each memory item across four orthogonal graphs:

Semantic graph: conceptual and meaning relationships
Temporal graph: chronological ordering and recency
Causal graph: cause-and-effect chains
Entity graph: relationships between actors and objects

Retrieval is formulated as policy-guided traversal across these graph views rather than a single similarity lookup. This enables query-adaptive context construction — a temporal question pulls from the temporal graph; a causality question traverses the causal graph. Results on LoCoMo and LongMemEval benchmarks show 18.6–45.5% improvement over prior baselines, with 95% reduction in token consumption compared to full-context approaches.

Membox addresses long-horizon dialogue specifically with a hierarchical memory architecture. A "Topic Loom" groups consecutive same-topic dialogue turns into coherent memory boxes, then links them via long-range event-timeline traces. This structure prevents the common failure mode where an agent has high-quality memory of individual exchanges but cannot reconstruct the arc of a multi-day conversation.

Pattern 4: Externalizing State to Survive Process Restarts

Production agent systems require state that survives not just session boundaries but process restarts, crashes, and deploys. Key design requirements observed across production deployments:

Checkpointed execution: the ability to stop, resume, and retry at the subtask level (not the whole-task level)
Intermediate result persistence: partial work stored durably so crashes don't trigger full restarts
Cross-agent handoff state: when multiple agents collaborate, shared state must be accessible to any agent instance, not held in a single process's memory

This is the domain of durable execution frameworks (Temporal, Restate, custom solutions) applied to agent workloads. The core insight: agents should be designed assuming their process will die mid-task, and the task should resume at the last checkpoint rather than from scratch.

Goal Drift: Causes and Mitigations

Goal drift is the tendency of long-horizon agents to gradually narrow, distort, or abandon their original objective. Observed causes:

Context dilution: As the context window fills with task artifacts, the original goal prompt becomes relatively less prominent
Premature success patterns: Agents trained to be helpful complete subtasks and may interpret partial completion as task completion
Scope narrowing under uncertainty: Agents facing ambiguous situations tend to reduce scope to what they can confidently complete
Error accumulation: Each small deviation from the goal compounds across steps

Practical mitigations from production systems:

Explicit goal anchoring: Repeat the top-level goal at the start of each agent turn, not just the session
Pass/fail feature manifests: Structured checklists with explicit completion criteria prevent premature victory declarations
Mandatory verification gates: Require end-to-end tests (browser automation, integration tests) before marking features complete — agents frequently mark work done without verifying user-facing behavior
Session-end commit requirements: Enforce a git commit with a human-readable summary at every session boundary, making goal drift visible in the commit history

The Reliability Horizon: Where the Field Stands

Current data on AI task duration shows a consistent trend: autonomous task duration doubles every 7 months. In 2026:

2-hour autonomous tasks are routine for capable coding agents
8-hour workday-length tasks are the leading edge of production capability
Projections suggest 40-hour work-week-length tasks by 2028 and 167-hour work-month tasks by 2029

The engineering challenge these projections present is substantial. A task doubling in duration quadruples its failure rate under current architectures. Reaching 40-hour reliability will require not just better memory systems but fundamentally different error recovery, goal persistence, and human-in-the-loop designs.

The patterns described above — artifact-based memory, durable goal objects, structured memory graphs, checkpointed execution — are the current toolkit. None of them alone solves the reliability cliff; production systems combine several of them.

Implications for Zylos

For a long-running AI agent system like Zylos, these findings translate directly to design choices:

Session handoff artifacts: The current sessions/current.md memory pattern is the right instinct. Formalizing it with structured pass/fail task tracking and mandatory end-of-session commits would improve cross-session continuity.
Goal vs. task distinction: The Zylos scheduler tracks tasks (discrete, completable units). A goal layer above this — durable objectives that survive individual task failures — would enable multi-day work without loss of direction.
Memory graph structure: The current flat memory files work well for the current scale. As memory grows, structured retrieval (temporal, semantic, causal indexes) will outperform linear search.
Failure rate awareness: Tasks expected to run beyond 35 minutes should have explicit checkpoint and retry design, not just execution.

Sources: