Replayable Agent Runtimes: Event Logs, Determinism, and Trace-to-Eval Loops
Executive Summary
Replayable agent runtimes need more than ordinary logging. Production agents need an authoritative execution history that can recover interrupted work, plus an observability trace that helps humans inspect failures, convert incidents into evals, and improve the system. The durable-execution ecosystem shows the core pattern: keep orchestration deterministic, isolate side effects behind recorded operations, and treat current state as a projection of an append-only history.
The important distinction is that recovery replay, debug replay, forensic replay, and evaluation replay are not the same product feature. A runtime can deterministically recover from a crash without being able to reproduce an LLM's reasoning bit-for-bit; an eval system can rerun a trace-derived example without being safe to reissue real-world side effects. Agent infrastructure should make these modes explicit instead of using "replay" as a vague promise.
Why Replay Matters for Agents
Long-running agents fail in ways that normal request/response applications rarely do:
- They may spend minutes or hours across model calls, retrieval, browser sessions, tool calls, human approvals, and background waits.
- They may partially complete work before a crash, rate limit, context compaction, model timeout, or user interruption.
- They may perform side effects that cannot be safely repeated, such as sending a message, writing a file, creating a ticket, charging a card, or changing configuration.
- They may need to explain why a decision was made after the original model context is gone.
Traditional logs help after the fact, but they are usually not sufficient as a recovery substrate. A replayable agent runtime needs a canonical execution record that says what was decided, what was scheduled, what side effects were attempted, what results came back, which approvals were granted, which artifacts were produced, and which runtime/prompt/tool versions were active.
The Two-Log Model
A practical design separates two records that are often conflated:
- Execution log: the source of truth for recovery and audit. It records workflow state transitions, model-call requests and response hashes or payload references, tool invocations, idempotency keys, approvals, retries, side-effect receipts, artifact hashes, prompt versions, model versions, and projection checkpoints.
- Observability trace: the queryable diagnostic view. It records spans for model calls, tools, retrieval, agents, handoffs, token usage, latency, errors, metadata, annotations, and evaluation results.
The execution log must be complete enough to rebuild state and prevent duplicated side effects. The trace can be sampled, redacted, indexed, visualized, and exported. If the trace becomes the only record, recovery semantics become hostage to retention windows, vendor schemas, sampling choices, and privacy redaction. If the execution log tries to become the only trace UI, debugging becomes slow and expensive.
What Durable Execution Teaches
Durable execution systems provide a useful baseline. Temporal's technical guide describes durable execution as persisting each step so a process can die and resume elsewhere with state intact; retries, timeouts, task queues, and long waits become runtime concerns rather than hand-written plumbing. Microsoft's Durable Task documentation is more explicit about the mechanism: orchestrators use event sourcing and may replay multiple times, so orchestrator code must be deterministic.
That constraint maps directly to agents. The agent orchestration layer should be deterministic over recorded history. Nondeterministic work belongs behind a durable boundary:
- wall-clock time becomes a recorded timestamp or durable timer
- randomness becomes a recorded value or seeded generator
- model calls become activities with captured inputs, model identity, tool schema, and response reference
- external APIs become idempotent activities with receipts
- file writes become validated mutations with content hashes
- human approvals become explicit events with actor, scope, payload, and timestamp
AWS's Durable Execution SDK makes the same point: handler code outside durable operations should be a pure function of inputs and completed operation results. Microsoft warns against direct I/O, mutable static variables, environment reads, ordinary sleep, and arbitrary async work inside orchestrators because replay can otherwise diverge or duplicate effects.
Replay Is Not One Thing
Agent platforms should name their replay modes precisely:
| Replay mode | Purpose | What should be deterministic | What can vary |
|---|---|---|---|
| Recovery replay | Resume after crash, timeout, or restart | Orchestration path and completed side-effect results | Future model/tool calls after the resume point |
| Debug replay | Reconstruct what happened for inspection | State projection, causal order, trace links | UI views, annotations, derived summaries |
| Forensic replay | Prove an execution record was not tampered with | Event sequence, hashes, approvals, artifacts | Human interpretation |
| Evaluation replay | Test a new prompt/model/tool policy on old cases | Test input, reference evidence, evaluator version | LLM outputs and scores unless mocked |
The common failure is to assume evaluation replay is equivalent to exact replay. It is not. Unless model outputs, retrieval results, tool responses, environment state, and relevant external data are recorded or mocked, rerunning an LLM agent produces a new trial, not the same execution.
Trace-to-Eval as the Improvement Loop
Observability tools are converging around a trace-to-eval workflow. LangSmith captures production traces and lets teams move examples into datasets for offline experiments and comparison across prompts, models, or configurations. Phoenix and W&B Weave similarly model LLM applications as nested calls or spans with inputs, outputs, metadata, exceptions, usage, and evaluation artifacts. OpenTelemetry's GenAI semantic conventions are emerging as the shared vocabulary for model spans, agent spans, events, metrics, provider-specific attributes, token usage, and MCP-related telemetry.
For agent teams, the loop should look like this:
- A production run fails, stalls, loops, overspends, or produces a weak answer.
- The trace identifies where it failed: retrieval, tool schema, tool result, prompt policy, model call, approval boundary, or state transition.
- The execution log anchors the exact evidence and side-effect state.
- A sanitized case is promoted into an eval dataset with references and expected behavior.
- New prompts, models, routing rules, or tool policies are tested against that dataset.
- The fixed version ships with a regression case that prevents the same class of failure from silently returning.
The article from February on OpenTelemetry observability covered tracing as visibility. The missing layer here is replayability: traces explain; execution logs recover; eval datasets improve. A production agent needs all three.
Implementation Pattern
A minimal replayable agent runtime can start with a small event vocabulary:
{
"event_id": "evt_...",
"run_id": "run_...",
"sequence": 42,
"type": "tool.completed",
"timestamp": "2026-04-26T12:11:00Z",
"causation_id": "evt_...",
"correlation_id": "task_...",
"actor": "agent",
"runtime": {
"workflow_version": "2026-04-26",
"prompt_version": "resume-eval-v3",
"model": "claude-sonnet",
"tool_schema_version": "github-v2"
},
"payload_ref": "sha256:...",
"side_effect": {
"idempotency_key": "tool:github:create-pr:...",
"receipt_ref": "sha256:..."
}
}
From there, build projections:
- current run state
- pending approvals
- completed side effects
- artifact inventory
- cost and token ledger
- failure timeline
- eval-case candidates
Projection bugs are recoverable because the event log remains the source of truth. If a projection is wrong, rebuild it from history. If an event is wrong, append a compensating event rather than mutating history.
Design Rules
- Separate intention from mutation. Let the model emit structured intent; let deterministic runtime code validate it, append events, and execute side effects.
- Record side-effect receipts. Every external mutation needs an idempotency key and a durable receipt so recovery does not repeat it blindly.
- Version all moving parts. Prompt, model, tool schema, sandbox image, retrieval index, evaluator, and workflow code version all matter.
- Do not put secrets in traces by default. OpenTelemetry's GenAI spec explicitly warns that prompts, outputs, and system instructions can contain sensitive data; capture must be policy-controlled.
- Treat human approval as data. Approval should include the actor, exact action, input payload, scope, and expiry, not just a UI button state.
- Compact long histories deliberately. Use snapshots, projections, and continue-as-new patterns so replay cost does not grow without bound.
- Classify replay mode before building. Recovery, debug, forensic, and eval replay need different guarantees.
Failure Modes
Duplicate side effects. The agent crashes after an API call succeeds but before local state records it. On retry, it sends the same email, creates a second issue, or writes a second file. The fix is idempotency plus durable receipts.
Nondeterministic branching. The replay path changes because orchestration code reads current time, random values, environment variables, mutable global state, external files, or live APIs. The fix is to move nondeterminism into recorded durable operations.
Trace mistaken for authority. A trace viewer shows useful spans, but sampling, retention, redaction, or vendor transforms mean it cannot rebuild exact state. The fix is a separate execution log.
LLM replay drift. A team reruns a historical case and expects the same output, but the model, retrieval index, context packing, or tool result has changed. The fix is to label it as evaluation replay and store enough references to make comparisons meaningful.
Version skew. A long-running workflow started under one code version resumes under another. Durable systems solve this with versioning, patch markers, or compatible workflow evolution. Agent runtimes need the same discipline for prompts and tool schemas.
Approval ambiguity. A human approved "send it" without the runtime recording exactly what "it" was. The fix is approval events tied to immutable payload hashes.
Implications for Agent Platforms
Replayability changes the shape of an agent runtime. The runtime is no longer just a chat loop with tools; it becomes an event processor with deterministic orchestration, durable side-effect boundaries, and trace-linked evaluation.
This is especially relevant for autonomous coding agents. A coding agent touches files, shells, package managers, remote APIs, issue trackers, and PRs. Replaying the model's private reasoning is less important than replaying the public execution contract: what it intended to change, what command ran, what files changed, which tests passed, what artifacts were produced, and whether the final state matches the requested outcome.
The strongest design is not "make LLMs deterministic." It is "make the runtime deterministic around nondeterministic intelligence." Let the model remain probabilistic, but constrain the operational envelope: typed intents, validated actions, append-only events, idempotent tools, explicit approvals, and eval cases born from real failures.
Sources
- Martin Fowler, "Event Sourcing" (2005): https://www.martinfowler.com/eaaDev/EventSourcing.html
- Temporal, "Durable Execution" technical guide: https://assets.temporal.io/durable-execution.pdf
- Temporal, "Ensuring Deterministic Execution" replay notes: https://assets.temporal.io/w/ensuring-deterministic-execution.pdf
- Microsoft Learn, "Durable orchestrator code constraints" (updated 2026): https://learn.microsoft.com/en-us/azure/durable-task/common/durable-task-code-constraints
- AWS Durable Execution SDK, "Determinism during replay": https://docs.aws.amazon.com/durable-execution/patterns/best-practices/determinism/
- LangGraph docs, "Durable execution": https://docs.langchain.com/oss/python/langgraph/durable-execution
- LangGraph docs, "Persistence": https://docs.langchain.com/oss/python/langgraph/persistence
- LangSmith docs, "Observability concepts": https://docs.langchain.com/langsmith/observability-concepts
- LangSmith docs, "Evaluation concepts": https://docs.langchain.com/langsmith/evaluation-concepts
- OpenTelemetry, "Semantic conventions for generative AI systems": https://opentelemetry.io/docs/specs/semconv/gen-ai/
- OpenTelemetry blog, "AI Agent Observability - Evolving Standards and Best Practices": https://opentelemetry.io/blog/2025/ai-agent-observability/
- Arize Phoenix, "LLM Tracing and Observability": https://phoenix.arize.com/llm-tracing-and-observability-with-arize-phoenix/
- W&B Weave docs, "Tracing": https://weave-docs.wandb.ai/guides/tracking/tracing
- W&B Weave docs, "Evaluations": https://weave-docs.wandb.ai/guides/core-types/evaluations
- arXiv, "ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering" (2026): https://arxiv.org/abs/2602.23193
- arXiv, "Causal-Temporal Event Graphs: A Formal Model for Recursive Agent Execution Traces" (2026): https://arxiv.org/abs/2604.17557

