Transactional Outbox Pattern: Reliable Side-Effect Delivery in Microservices

Executive Summary

The transactional outbox pattern solves one of distributed systems' most stubborn problems: atomically committing a business state change and publishing a downstream event when the database and the message broker are two separate systems. Without it, any crash between the two writes leaves the system in an inconsistent state — either a silent data loss or a phantom event. The pattern's solution is elegant and low-ceremony: write the event into an outbox table in the same database transaction as the business record, then let a separate relay process forward it to the broker asynchronously. This document covers the full pattern — core mechanics, relay strategies (polling vs. CDC), the complementary inbox pattern for consumer idempotency, production observability requirements, and the anti-patterns that signal the pattern is being overused.

The Problem: Dual Write in Distributed Systems

Every service that both persists state and publishes events faces the dual-write problem. Two independent writes cannot be made atomic without a distributed coordination protocol:

Write business data to the database — then crash before publishing → event is lost, downstream systems are never notified.
Publish the event first, then write to the database — then crash before writing → a phantom event triggers downstream effects with no business record to back them.

Two-Phase Commit (2PC) would solve this at the protocol level but is prohibitively expensive at scale: it blocks participants, reduces throughput, and most message brokers do not participate in XA transactions. The outbox pattern trades 2PC for a simpler invariant: use the database itself as the source of truth for both business data and pending events.

Core Mechanics

The Outbox Table

The outbox table lives in the same database as the business tables. A canonical schema looks like:

CREATE TABLE outbox_tasks (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_type  TEXT NOT NULL,
    payload     JSONB NOT NULL,
    status      TEXT NOT NULL DEFAULT 'pending',   -- pending | processed | failed
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    processed_at TIMESTAMPTZ,
    retry_count INT NOT NULL DEFAULT 0
);

The Atomic Write

The business handler inserts into both the domain table and the outbox table in a single transaction:

BEGIN;
  INSERT INTO organizations (id, name, ...) VALUES (...);
  INSERT INTO outbox_tasks (event_type, payload)
    VALUES ('org.created', '{"org_id": "...", "name": "..."}');
COMMIT;

Either both records land, or neither does. The relay process is now the only actor responsible for forwarding the event — application code never talks to the broker directly.

The Relay (Outbox Processor)

A background process periodically reads pending rows, publishes them to the broker, then marks them processed:

SELECT id, event_type, payload
FROM outbox_tasks
WHERE status = 'pending'
ORDER BY created_at
LIMIT 100
FOR UPDATE SKIP LOCKED;   -- safe for concurrent relay workers

After each successful publish:

UPDATE outbox_tasks SET status = 'processed', processed_at = now()
WHERE id = $1;

Relay Strategies: Polling vs. Change Data Capture

There are two fundamentally different approaches to implementing the relay, with meaningful trade-offs between them.

Polling Publisher

The relay runs on a timer (typically every 100ms–2s) and queries the outbox table directly.

Advantages:

Simple to build and debug — pure application code, no infrastructure dependencies beyond the database.
Works with any relational database.
No additional operational components.

Disadvantages:

Adds continuous query load to the database (SELECT ... FOR UPDATE SKIP LOCKED).
At scale, each microservice running its own polling loop compounds database pressure — index bloat, lock contention, and connection exhaustion become real risks.
Introduces artificial latency equal to the polling interval.

Recommendation: Start here. The schema is identical to the CDC approach, so switching later requires no application changes.

Change Data Capture (CDC) with Debezium

CDC tools such as Debezium read events directly from the database's Write-Ahead Log (WAL). PostgreSQL's logical replication exports committed changes as a stream; Debezium translates these into broker messages without ever querying the outbox table.

Advantages:

Near-zero latency — events appear in the broker within milliseconds of commit.
No additional load from SELECT queries on the application database.
Scales horizontally without compounding database pressure.

Disadvantages:

Requires Kafka (or a compatible broker) as a target.
WAL retention must be managed: Debezium holds a replication slot, and if it falls behind, the WAL grows without bound and can fill the disk.
More operational surface area — Debezium connector, Kafka Connect cluster, slot health monitoring.

The critical production risk with CDC is not CPU, it is WAL retention. WAL accumulates until the replication slot's LSN advances. Monitor the slot lag metric (pg_replication_slots.confirmed_flush_lsn) and alert aggressively.

Recommendation: Adopt CDC when polling latency or database load becomes a measurable problem, or when the team already operates Kafka infrastructure. The outbox table schema is the same either way.

Delivery Guarantees and Consumer Idempotency

The outbox pattern guarantees at-least-once delivery, not exactly-once. The classic failure scenario:

Relay publishes the event to the broker successfully.
Relay crashes before updating status = 'processed'.
On restart, the relay re-reads the pending row and publishes again.

The broker now has a duplicate. Downstream consumers must handle it. Two standard approaches:

Idempotency Key Check

Every event carries a stable, unique identifier (e.g., the outbox row's UUID). The consumer maintains a processed_events table:

CREATE TABLE processed_events (
    event_id UUID PRIMARY KEY,
    processed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Before processing: INSERT INTO processed_events (event_id) VALUES ($1) ON CONFLICT DO NOTHING — if zero rows are inserted, the event is a duplicate and is skipped.

The Inbox Pattern

The inbox is the consumer-side complement to the outbox. Instead of the consumer executing business logic directly on receipt, it writes the incoming event to a local inbox table (in the same transaction as its local business operation). A separate background processor dequeues and handles the event once.

The inbox provides:

Deduplication — the inbox table's unique constraint on event_id rejects duplicates at the database level.
Consumer auditability — the inbox is a persistent record of all events received, in arrival order.
Decoupling from broker availability — the consumer can process events even if the broker is momentarily unavailable.

Outbox + Inbox together provide end-to-end reliability: the outbox guarantees the event leaves the producer; the inbox guarantees it is processed exactly once by the consumer.

Comparison with Related Patterns

Outbox vs. Saga

These patterns operate at different levels and are often combined:

Dimension	Transactional Outbox	Saga Pattern
Concern	Reliable event publishing	Coordinating a multi-step distributed transaction
Scope	Single service, single step	Multiple services, multiple steps
Failure recovery	Relay retries until published	Compensating transactions undo prior steps
Complexity	Low	Medium–High

The outbox is typically used inside a saga: each saga step publishes its transition event through the outbox, guaranteeing the next saga participant is notified reliably. Without the outbox, a saga step can commit locally but fail to notify the next participant.

Outbox vs. 2PC (Two-Phase Commit)

Dimension	Transactional Outbox	Two-Phase Commit
Atomicity	Per-database ACID + async relay	Distributed protocol across participants
Throughput	High (no blocking coordinator)	Low (all participants blocked during prepare phase)
Infrastructure	Database + relay process	Distributed transaction manager
Broker support	Not required	Most brokers do not support XA

The outbox wins on every operational dimension at the cost of accepting eventual consistency rather than synchronous consistency.

Go + PostgreSQL Implementation Patterns

The coco-outbox library (used in the Zylos/coco-workspace ecosystem) provides a reusable implementation of this pattern for Go services backed by PostgreSQL.

Key API surface

// Record an event atomically within a transaction
store.WithTx(tx).Record(ctx, outbox.Task{
    EventType: "org.created",
    Payload:   json.RawMessage(`{"org_id": "..."}`),
})

// The relay polls and dispatches
service.Start(ctx)  // begins background polling

The WithTx(tx) call is the critical integration point — it binds the outbox write to the caller's existing transaction, enforcing the atomic invariant at the API level rather than relying on caller discipline.

Integration Phase Pattern (cws-core case)

When integrating coco-outbox into an existing service that previously had its own ad-hoc init_task implementation:

Phase 1 (module replacement): Replace the bespoke init_task implementation with coco-outbox's Store/Service API. The event is still recorded in a standalone call (not within the business transaction) — this is architecturally correct for now and establishes the API boundary.
Phase 2 (transaction integration): Once handler-level transaction support lands (e.g., Issue #79 in cws-core), move the Record call inside the business transaction using store.WithTx(tx). No schema change is required — only the call site moves.

This two-phase approach decouples module replacement from full consistency guarantees, reducing the blast radius of each change.

Production Observability Requirements

The outbox relay is a silent background process — without monitoring, it can fail for hours with no visible symptom while events queue up.

Metrics to instrument

Metric	Alert Condition
Outbox lag — time between event creation and successful publish	Alert if P99 exceeds SLA (e.g., > 30s for non-realtime, > 500ms for realtime)
Pending row count — rows in `status = 'pending'`	Alert if count grows monotonically (relay is stuck or falling behind)
Retry count distribution	Alert if any row exceeds max retries
DLQ size — events routed to dead-letter after max retries	Alert on any non-zero growth
Relay process health — PM2/process supervisor status	Alert on crash or restart
WAL slot lag (CDC only)	Alert if slot lag exceeds configured threshold

Dead Letter Queue

Events that fail to publish after N retries must not be silently dropped. Route them to a DLQ with the original event, error details, and retry history. This enables:

Manual replay after fixing the downstream issue.
SRE investigation of the failure root cause.
Audit trail completeness.

When NOT to Use the Outbox Pattern

The pattern adds real complexity: an extra table, a relay process, observability requirements, and consumer idempotency logic. It is not the right default for every situation.

Avoid the outbox pattern when:

The service is a monolith. The outbox solves an inter-service consistency problem. In a monolith, direct function calls within a transaction are always atomic — no relay is needed.
The downstream action is synchronous and user-visible. If the user is waiting for the action and the UI reflects its result immediately, eventual consistency may create a confusing experience. Synchronous calls with retry may be simpler.
No message broker exists. Introducing a broker just to support the outbox relay is rarely justified for a single use case. Consider whether the use case requires pub/sub at all.
The system can tolerate missed events. If the downstream effect is best-effort (e.g., cache warming, analytics), a best-effort async call without the outbox overhead may be sufficient.
The failure window is observable and recoverable. If the service can detect and manually repair a missed event (e.g., a reconciliation job), the simpler dual-write may be acceptable.

The key question is: what is the business cost of a missed event? If the answer is "critical — it causes billing errors, missing notifications, or data inconsistency," the outbox pattern is justified. If the answer is "low — it delays a cache refresh," simpler alternatives are likely better.

Operational Checklist

For teams deploying the outbox pattern in production:

Outbox table indexed on (status, created_at) to support efficient polling
SELECT ... FOR UPDATE SKIP LOCKED used by relay to support concurrent workers
Relay process supervised (PM2, systemd, Kubernetes liveness probe)
Outbox lag metric instrumented and alerted
Pending row count alerted on monotonic growth
DLQ implemented for events exceeding max retries
Consumer idempotency implemented (inbox table or idempotency key check)
WAL slot lag monitored if using CDC/Debezium
Periodic cleanup job for processed rows older than retention window

Key Takeaways

The transactional outbox pattern eliminates the dual-write problem by making the database — not the broker — the single atomic unit of truth for both state and pending events.
Two relay strategies: polling (simple, start here) and CDC (low latency, lower DB load, higher operational complexity). The outbox schema is identical for both, so migration is non-breaking.
The pattern guarantees at-least-once delivery. Exactly-once processing requires consumer-side idempotency — either an inbox table or an idempotency key log.
Outbox + Inbox together provide end-to-end reliability across the producer-consumer boundary without distributed transactions.
Observability is non-optional in production: instrument outbox lag, pending row count, DLQ size, and relay process health.
Don't default to the outbox everywhere — evaluate the business cost of a missed event against the complexity the pattern adds.