Zylos LogoZylos
2026-03-04

AI Agent Workflow Checkpointing and Resumability

ai-agentsfault-tolerancedurable-executionlanggraphtemporalworkflow-orchestrationcheckpointingproduction

Executive Summary

Long-running AI agent workflows are fundamentally different from short request-response interactions. A task that orchestrates dozens of tool calls, spans hours of execution, or requires human review mid-run cannot tolerate starting from scratch on failure. Checkpointing — saving execution state at defined intervals or boundaries — is the mechanism that closes this gap. Combined with event-history replay and idempotent tool design, checkpointing transforms brittle agentic pipelines into fault-tolerant, resumable workflows that behave like dependable software even when infrastructure fails.

Key Points

  • Checkpointing is now standard in production agent frameworks. LangGraph, Temporal, and Dagster all ship first-class checkpoint primitives; MemorySaver is for development, SQLite/Postgres for production.
  • Durable execution (Temporal's model) replays event history to reconstruct in-memory state after a crash, so agents resume at the exact step of failure without re-running completed work.
  • Idempotency is a prerequisite for safe checkpointing. Any tool that writes external state (creates a ticket, sends an email, charges a card) must carry an idempotency key tied to the workflow state to prevent duplicate side effects on replay.
  • Human-in-the-loop scenarios are a natural extension of checkpointing: the workflow persists state, pauses indefinitely waiting for human input, then resumes exactly from the pause point.
  • Very long workflows use a "continue-as-new" pattern (Temporal) or graph re-initialization to avoid event-history bloat while retaining durability.

Deep Dive

The Core Problem: Ephemeral Agents in a Flakey World

A naive agent implementation is a single process: receive task, run LLM loop, call tools, return result. This works fine for tasks that complete in seconds. It breaks the moment execution extends to minutes or hours, because any of the following will interrupt it mid-flight:

  • API rate-limit or timeout from any tool
  • Worker process restart (deployment, OOM, crash)
  • Context window exhaustion requiring a fresh model call
  • Human approval step that takes hours to complete

Without checkpointing, the entire workflow restarts from step one. For an agent that has already called ten external tools, this means duplicate work, duplicate side effects, and wasted cost.

Checkpoint Granularity

The right checkpoint boundary depends on the cost and reversibility of each step:

Node-level checkpointing (LangGraph model): Every graph node (tool call, LLM decision, conditional branch) writes a checkpoint before and after execution. Finer nodes mean more frequent checkpoints and less repeated work on recovery. The trade-off is storage volume — a checkpoint per node in a 50-step workflow generates 50 persisted states.

Activity-level checkpointing (Temporal model): Each Temporal Activity (a unit of work with retry logic) is recorded in the Event History. The Workflow code itself is deterministic and replays against this history on recovery, skipping already-completed Activities by using their recorded results rather than re-executing them. This is more efficient than per-node storage because history is append-only and compacted.

Explicit commit points: A simpler approach used in custom pipelines — the developer manually inserts save calls at "safe" boundaries (after a group of reads but before any writes). Coarser granularity means potentially re-doing more work, but it is far easier to reason about than automatic per-step persistence.

Storage Backends

BackendUse CaseDurability
MemorySaver (in-process)Development and testing onlyNone — lost on restart
SqliteSaverLocal workflows, low-concurrencyDurable on disk, single-node
PostgresSaverProduction multi-tenant agentsDurable, concurrent, scalable
RedisHigh-throughput short-lived checkpointsConfigurable persistence
S3 / object storageLarge workflow artifacts, archivalHighly durable, cheap at scale

LangGraph recommends PostgresSaver for production and uses it internally in LangSmith. Temporal manages its own storage via a Temporal Service cluster (backed by Cassandra or PostgreSQL).

Event-History Replay: The Temporal Approach

Temporal's durable execution model is worth understanding deeply because it represents the most rigorous approach to workflow resumability. The core insight: workflow code should be deterministic — given the same event history, it always produces the same commands. This makes replay safe.

When a worker crashes at step 5 of a 10-step workflow:

  1. A new worker picks up the workflow from Temporal's service.
  2. It replays the event history from the beginning.
  3. Every completed Activity call is skipped — instead of re-executing, the worker reads the recorded result from history.
  4. When replay reaches the last completed step, the worker resumes normal execution.
  5. From the worker's perspective, execution is seamless. From the system's perspective, no work was repeated.

This is why workflow code must avoid non-deterministic operations (random numbers, timestamps, direct network calls) — those must be wrapped as Activities whose results are recorded in history. The pattern is: anything that touches the outside world is an Activity; everything else is deterministic workflow logic.

The 2025 integration between OpenAI's Agents SDK and Temporal brought this model to agents built on OpenAI tooling — retries, state management, and implicit checkpoint-like recovery are handled automatically without agent authors needing to write checkpoint logic themselves.

LangGraph Checkpointing: State Graph Persistence

LangGraph takes a different angle suited to the graph-of-nodes model. A checkpointer is attached to the graph at compile time:

from langgraph.checkpoint.sqlite import SqliteSaver

memory = SqliteSaver.from_conn_string(":memory:")  # dev
# or for production:
# memory = PostgresSaver.from_conn_string(DATABASE_URL)

graph = workflow.compile(checkpointer=memory)

Every super-step (execution of one or more nodes) writes a checkpoint keyed by thread_id. Resuming is a matter of passing the same thread_id back to the graph:

config = {"configurable": {"thread_id": "task-abc-123"}}
result = graph.invoke(inputs, config=config)

If the process died partway through, the next invocation with the same thread finds the last checkpoint and continues from there. This also enables time-travel debugging: because every state is persisted, developers can inspect the graph state at any historical step and re-run from any point — invaluable for debugging non-deterministic LLM outputs.

Human-in-the-Loop as a First-Class Checkpoint

One of the most compelling use cases for checkpointing is human approval. An agent may need to:

  • Present a draft action to a human before executing it
  • Wait for a human to supply a value (approval code, clarification, override)
  • Pause for asynchronous review by a separate team member

Without checkpointing, this requires a complex external queuing system. With checkpointing, the workflow simply emits an interrupt, persists state, and halts. The human reviews the checkpoint state through a UI (or API), provides input, and resumes the workflow. The agent continues exactly where it paused, with human input injected into its state.

This pattern enables a spectrum from fully autonomous (no interrupts) to fully supervised (interrupt before every tool call), with the checkpoint system handling the persistence regardless of where the agent sits on that spectrum.

Idempotency: The Prerequisite for Safe Replay

Checkpointing alone is not sufficient. If a workflow is replayed and tool calls are re-executed without idempotency, the result is duplicate side effects: two tickets created, two emails sent, two charges processed.

The solution is to bind every external write to a deterministic idempotency key derived from the workflow state:

# Bad: retrying will create duplicate records
create_ticket(title="Deploy failed", description=error_msg)

# Good: idempotency key ensures exactly-once creation
create_ticket(
    title="Deploy failed",
    description=error_msg,
    idempotency_key=f"{workflow_id}:{step_name}"
)

Temporal automatically handles this for Activities: a completed Activity's result is cached in history, so the Activity code never runs twice for the same workflow execution. For custom pipelines without this guarantee, idempotency keys must be implemented at the tool level.

Handling Very Long Workflows

Event histories grow unbounded in long-running workflows. Temporal addresses this with Continue-As-New: when history grows too large, the workflow atomically completes the current run and starts a new run with the same workflow ID, carrying forward only the essential state. The workflow appears continuous to external observers but internally uses a fresh history log.

LangGraph addresses this differently: by treating workflow state as explicit Python types, developers can control what gets persisted and prune stale context. For truly long agent runs, this often means compressing or summarizing historical context rather than keeping every past node's full state.

Production Checklist

Before shipping a long-running agent to production:

  1. Choose the right checkpointer: PostgresSaver or equivalent — never MemorySaver.
  2. Define thread IDs carefully: They are the primary key for resumption; use stable, deterministic IDs tied to the business task (not random UUIDs per attempt).
  3. Wrap all external calls as idempotent operations: Tie idempotency keys to workflow + step identity.
  4. Test failure injection: Deliberately kill the worker mid-workflow and verify it resumes correctly from checkpoint.
  5. Monitor checkpoint lag: Slow checkpoint writes under load can become a bottleneck — use async checkpointing or batching where possible.
  6. Plan for history pruning: Set retention policies on checkpoint storage to avoid unbounded growth.
  7. Separate read and write operations: Read-only tool calls (search, lookup) are safe to replay freely; write operations need the idempotency treatment above.

Sources