Zylos Logo
Zylos
2026-02-26

Webhook Reliability and Delivery Guarantees for AI Agent Platforms

webhooksreliabilitydelivery-guaranteesai-agentsdistributed-systemshmacidempotency

Executive Summary

Webhooks are the connective tissue of modern event-driven architectures, and their reliability characteristics directly determine system-wide correctness. When an AI agent misses a payment confirmation, triggers a duplicate action, or processes events out of order, the consequences cascade unpredictably across business logic that was never designed for those edge cases. Getting webhooks right is not optional — it is a fundamental infrastructure concern.

The core insight that every practitioner eventually learns is that exactly-once delivery is a theoretical ideal, not an engineering reality. Even Stripe, GitHub, and AWS — who collectively process billions of webhook deliveries — operate on at-least-once semantics. The practical solution is to combine at-least-once delivery from the sender with idempotent receivers that can absorb duplicate deliveries without producing duplicate effects. This pairing, supported by idempotency keys and persistent deduplication stores, is the actual production-grade pattern.

For AI agent platforms specifically, the stakes are higher than for conventional SaaS. Agents act on webhook payloads autonomously: an agent that receives a duplicate task.completed event may mark a pipeline step done twice, skip a human review gate, or dispatch a second notification to a user. The infrastructure reliability layer must be treated as a first-class concern rather than delegated to whatever the hosting framework provides by default.

This article covers the full stack: delivery semantics, retry strategies with exponential backoff and circuit breakers, HMAC signature verification and replay attack prevention, dead letter queues, ordering challenges, rate limiting, infrastructure patterns from Stripe and GitHub, testing approaches, and the emerging standards (CloudEvents and Standard Webhooks) that are beginning to regularize the space. A dedicated section addresses AI agent-specific considerations that go beyond typical SaaS webhook guidance.


Delivery Semantics: At-Least-Once vs. Exactly-Once

The Fundamental Impossibility

Exactly-once delivery is the holy grail of distributed messaging — and it is essentially unachievable across network boundaries without distributed transaction protocols (like two-phase commit) that introduce unacceptable latency and complexity. The fundamental problem: the sender must know whether the receiver processed the message, but the acknowledgment itself can be lost. If the sender retries after a network timeout, the receiver may process the event twice. If the sender does not retry, the receiver may never process it at all.

This is not a solvable problem at the transport layer. It is a constraint imposed by the CAP theorem and the nature of asynchronous communication. The correct response is to accept it and build systems that tolerate it.

At-Least-Once: The Production Standard

At-least-once delivery guarantees that every event will be delivered — eventually — even if it means delivering it multiple times. This is the model used by:

  • Stripe: Webhooks may be delivered more than once; endpoints should be idempotent
  • GitHub: Guarantees eventual delivery but documents that duplicates are possible
  • Slack: Explicitly states that event delivery may be attempted multiple times
  • AWS SNS/SQS: At-least-once is the default; exactly-once requires FIFO queues with additional configuration

At-least-once is achievable because the sender only needs to retry until it receives a 2xx acknowledgment. It does not need to coordinate a distributed transaction or verify that the receiver's side effects completed.

At-Most-Once: The Simple but Lossy Alternative

At-most-once delivery fires and forgets — the sender makes one attempt and moves on regardless of outcome. This is appropriate only for genuinely loss-tolerant use cases like analytics events or heartbeat pings. For anything that drives business logic, at-most-once is the wrong choice.

Achieving Practical Exactly-Once: Idempotent Consumers

The industry standard pattern combines at-least-once delivery with idempotent processing on the receiver side:

// Idempotent webhook handler (TypeScript/Node.js)
import { createClient } from 'redis';
import { db } from './database';

const redis = createClient();

async function handleWebhookEvent(event: WebhookEvent): Promise<void> {
  const idempotencyKey = `webhook:processed:${event.id}`;

  // Attempt to acquire the lock atomically
  // NX = only set if not exists, EX = expiry in seconds
  const acquired = await redis.set(idempotencyKey, '1', {
    NX: true,
    EX: 86400, // 24-hour window to detect duplicates
  });

  if (!acquired) {
    // Already processed — return 200 to prevent sender retrying
    console.log(`Duplicate event ${event.id} — skipping`);
    return;
  }

  // Process event exactly once
  await db.transaction(async (tx) => {
    await processBusinessLogic(tx, event);
    await tx.insert('processed_events', {
      event_id: event.id,
      processed_at: new Date(),
      payload_hash: hashPayload(event),
    });
  });
}

The Redis SET NX operation is atomic — only one instance of a horizontally scaled receiver will succeed in claiming the lock, preventing parallel duplicate processing. The 24-hour TTL ensures the deduplication store does not grow unbounded.

A simpler database-only approach using a unique constraint:

async function handleWebhookIdempotent(event: WebhookEvent): Promise<void> {
  try {
    // Unique constraint on event_id prevents double insertion
    await db.insert('processed_events').values({
      event_id: event.id,
      processed_at: new Date(),
    });
  } catch (error) {
    if (isUniqueConstraintViolation(error)) {
      // Already processed — acknowledge and return
      return;
    }
    throw error;
  }

  // Process only if insert succeeded
  await processBusinessLogic(event);
}

Idempotency Keys in Practice

Idempotency keys are the unique identifiers that make deduplication possible. Best practices:

  1. Use the sender's event ID, not a locally generated ID. The webhook payload id field (or event_id, depending on the platform) is stable across retries — the same event keeps its ID each time it is delivered.
  2. Use the idempotency key as the storage key. Store processed event IDs in a database or Redis. Check this store before processing.
  3. Include payload hashing for corruption detection. If the same event ID arrives with a different payload, it indicates tampering or a bug — reject and alert.
  4. Set appropriate TTLs. Stripe recommends checking idempotency for at least 24 hours; 48–72 hours is safer for systems with long retry windows.

Retry Strategies

Exponential Backoff

The simplest retry strategy — fixed intervals — is dangerous in production. If all subscribers to a platform retry at T+5s after a failure, the recovering service faces a thundering herd at exactly that moment. Exponential backoff spreads retries across time:

delay(n) = base * (multiplier ^ n)

Typical parameters:

  • base: 1–5 seconds
  • multiplier: 2
  • Max delay: 1–24 hours
  • Max attempts: 3–10 (depending on SLA)

Real-world schedules used by major platforms:

PlatformRetry Schedule
StripeImmediately, then exponentially up to 3 days (total ~87 attempts)
Svix5s, 5m, 30m, 2h, 5h, 10h, 10h (7 attempts over ~27h)
GitHubHourly retries for up to 72 hours
Shopify19 attempts over 48 hours

Jitter: Breaking Synchronization

Even exponential backoff can cause synchronized spikes if many subscribers failed at the same moment (e.g., during a platform-wide outage recovery). Jitter adds randomness:

function computeBackoffDelay(
  attempt: number,
  baseDelayMs: number = 1000,
  maxDelayMs: number = 300_000, // 5 minutes
): number {
  // Exponential: base * 2^attempt
  const exponential = baseDelayMs * Math.pow(2, attempt);
  const capped = Math.min(exponential, maxDelayMs);

  // Full jitter: random value in [0, capped]
  // This is generally preferred over "equal jitter" for distributed systems
  return Math.random() * capped;
}

// Usage in a retry loop
async function deliverWithRetry(
  endpoint: string,
  payload: unknown,
  maxAttempts: number = 7,
): Promise<void> {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      const response = await fetch(endpoint, {
        method: 'POST',
        body: JSON.stringify(payload),
        headers: { 'Content-Type': 'application/json' },
        signal: AbortSignal.timeout(30_000), // 30s timeout
      });

      if (response.ok) return; // Success

      // Don't retry 4xx (except 429) — they won't succeed on retry
      if (response.status >= 400 && response.status < 500 && response.status !== 429) {
        throw new NonRetryableError(`Received ${response.status} — moving to DLQ`);
      }

      // 5xx or 429 — retry with backoff
    } catch (error) {
      if (error instanceof NonRetryableError) throw error;
      // Network errors, timeouts — retry
    }

    if (attempt < maxAttempts - 1) {
      const delay = computeBackoffDelay(attempt);
      await sleep(delay);
    }
  }

  throw new MaxAttemptsExceeded(`Failed after ${maxAttempts} attempts`);
}

Jitter alone can reduce synchronized retry spikes by over 80% in real-world systems.

Response Code Handling Matrix

Not all failures should be retried:

HTTP StatusMeaningAction
2xxSuccessAcknowledge, mark delivered
400 Bad RequestMalformed payloadMove to DLQ immediately — retrying won't help
401 UnauthorizedInvalid credentialsMove to DLQ, alert operator
404 Not FoundEndpoint removedDisable endpoint, notify subscriber
408 Request TimeoutSlow endpointRetry with backoff
429 Too Many RequestsRate limitedRetry with backoff, honor Retry-After header
5xx Server ErrorTransient failureRetry with backoff
Network timeoutInfrastructure issueRetry with backoff
DNS resolution failureConfiguration errorRetry limited times, then alert

Circuit Breakers

Where exponential backoff operates at the individual event level, circuit breakers operate at the endpoint level. They prevent sending any events to an endpoint that has been consistently failing — giving it space to recover without being hammered:

enum CircuitState {
  CLOSED = 'CLOSED',     // Normal operation
  OPEN = 'OPEN',         // Blocking requests
  HALF_OPEN = 'HALF_OPEN', // Testing recovery
}

class WebhookCircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private failureCount: number = 0;
  private lastFailureTime?: Date;
  private successCount: number = 0;

  constructor(
    private readonly failureThreshold: number = 5,
    private readonly recoveryTimeMs: number = 60_000, // 1 minute
    private readonly halfOpenSuccessThreshold: number = 2,
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === CircuitState.OPEN) {
      if (this.shouldAttemptReset()) {
        this.state = CircuitState.HALF_OPEN;
        this.successCount = 0;
      } else {
        // Route to holding queue instead
        throw new CircuitOpenError('Circuit is OPEN — routing to holding queue');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess(): void {
    if (this.state === CircuitState.HALF_OPEN) {
      this.successCount++;
      if (this.successCount >= this.halfOpenSuccessThreshold) {
        this.state = CircuitState.CLOSED;
        this.failureCount = 0;
      }
    }
    this.failureCount = 0;
  }

  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = new Date();
    if (this.failureCount >= this.failureThreshold) {
      this.state = CircuitState.OPEN;
    }
  }

  private shouldAttemptReset(): boolean {
    if (!this.lastFailureTime) return false;
    return Date.now() - this.lastFailureTime.getTime() > this.recoveryTimeMs;
  }
}

When a circuit opens, new events bypass the delivery queue and go to a holding queue. When the circuit closes (after the recovery window), the holding queue is drained at a controlled rate — preventing the thundering herd that would occur if all queued events were dispatched simultaneously.


HMAC Signature Verification

Why Signatures Matter

Without signature verification, any actor who knows your webhook endpoint URL can send arbitrary payloads to it. For an AI agent that acts autonomously on webhook content, this is a critical security surface: a malicious actor could trigger agent actions, inject false data, or cause denial-of-service.

HMAC (Hash-based Message Authentication Code) solves this by using a shared secret known only to the sender and receiver. The sender signs the payload, and the receiver verifies the signature before processing.

The HMAC-SHA256 Standard

The overwhelming industry standard is HMAC-SHA256, used by Stripe, GitHub, Shopify, Slack, and the Standard Webhooks specification:

import { createHmac, timingSafeEqual } from 'crypto';

interface WebhookVerificationOptions {
  secret: string;
  toleranceSeconds?: number; // Default: 300 (5 minutes)
}

function verifyWebhookSignature(
  rawBody: Buffer,
  headers: Record<string, string>,
  options: WebhookVerificationOptions,
): boolean {
  const { secret, toleranceSeconds = 300 } = options;

  // Extract timestamp and signatures from header
  // Stripe format: "t=1492774577,v1=5257a869e7ecebeda32affa62cdca3fa51cad7e77a05bd539313b8d25746afe"
  const sigHeader = headers['stripe-signature'] ?? headers['webhook-signature'];
  if (!sigHeader) return false;

  const parts = Object.fromEntries(
    sigHeader.split(',').map(part => part.split('=') as [string, string])
  );

  const timestamp = parseInt(parts['t'], 10);
  const receivedSignatures = sigHeader
    .split(',')
    .filter(p => p.startsWith('v1='))
    .map(p => p.slice(3));

  // 1. Validate timestamp to prevent replay attacks
  const now = Math.floor(Date.now() / 1000);
  if (Math.abs(now - timestamp) > toleranceSeconds) {
    throw new WebhookError('Timestamp outside tolerance window — possible replay attack');
  }

  // 2. Compute expected signature
  // Signed payload = timestamp + "." + raw body
  const signedPayload = `${timestamp}.${rawBody.toString('utf8')}`;
  const expectedSignature = createHmac('sha256', secret)
    .update(signedPayload)
    .digest('hex');

  // 3. Constant-time comparison to prevent timing attacks
  const expectedBuffer = Buffer.from(expectedSignature, 'hex');

  return receivedSignatures.some(sig => {
    const sigBuffer = Buffer.from(sig, 'hex');
    if (sigBuffer.length !== expectedBuffer.length) return false;
    return timingSafeEqual(sigBuffer, expectedBuffer);
  });
}

Why Timing-Safe Comparison is Non-Negotiable

Standard string comparison (===) short-circuits on the first mismatched character. An attacker can measure response times to determine how many leading characters of their guessed signature were correct, eventually reconstructing the valid signature through statistical analysis. timingSafeEqual compares all bytes in constant time, eliminating this timing side-channel.

Replay Attack Prevention

HMAC alone prevents payload tampering but not replay attacks. An attacker who captures a valid request (including its valid signature) can replay it hours or days later. The defense is timestamp inclusion:

  1. The sender includes a timestamp in the signature header (t=1492774577)
  2. The sender includes the timestamp in the signed data (so the signature is over "timestamp.body" not just "body")
  3. The receiver rejects any request where |now - timestamp| > tolerance
// Middleware for Express — reject stale webhooks
function webhookTimestampGuard(toleranceSeconds: number = 300) {
  return (req: Request, res: Response, next: NextFunction) => {
    const sigHeader = req.headers['stripe-signature'] as string;
    const timestampStr = sigHeader?.match(/t=(\d+)/)?.[1];

    if (!timestampStr) {
      return res.status(400).json({ error: 'Missing timestamp in signature header' });
    }

    const timestamp = parseInt(timestampStr, 10);
    const now = Math.floor(Date.now() / 1000);

    if (Math.abs(now - timestamp) > toleranceSeconds) {
      return res.status(400).json({ error: 'Request timestamp too old — possible replay attack' });
    }

    next();
  };
}

Standard tolerance window: 5 minutes (300 seconds). This accommodates clock skew between distributed systems while keeping the replay window small. Some high-security implementations use 3 minutes.

GitHub's Webhook Verification Pattern

GitHub uses a slightly different scheme — the signature is in X-Hub-Signature-256 as sha256=<hex>:

function verifyGitHubWebhook(
  payload: string,
  signature: string,
  secret: string,
): boolean {
  const expected = `sha256=${createHmac('sha256', secret).update(payload).digest('hex')}`;
  const expectedBuffer = Buffer.from(expected);
  const receivedBuffer = Buffer.from(signature);

  if (expectedBuffer.length !== receivedBuffer.length) return false;
  return timingSafeEqual(expectedBuffer, receivedBuffer);
}

Note: GitHub does not include a timestamp in the signature — their replay attack mitigation is delivery-ID-based deduplication rather than timestamp validation.

Key Rotation

Shared secrets should be rotatable without downtime. Best practice:

// Support multiple valid secrets during rotation window
async function verifyWithKeyRotation(
  rawBody: Buffer,
  headers: Record<string, string>,
  secrets: string[], // [currentSecret, oldSecret]
): Promise<boolean> {
  return secrets.some(secret =>
    verifyWebhookSignature(rawBody, headers, { secret })
  );
}

Maintain two valid secrets during rotation: the new key and the old key. After all senders have been updated, remove the old key.


Dead Letter Queues

Why Events End Up in the DLQ

Dead letter queues capture events that have exhausted all retry attempts. Common causes:

  • Persistent endpoint unavailability: Downstream service is down for longer than the retry window
  • Permanently malformed payloads: The receiver always returns 400 because the payload does not match the expected schema
  • Business logic errors: The receiver returns 500 because of a bug in event processing
  • Credential rotation failures: HMAC verification fails because the subscriber's secret is stale
  • Rate limit exhaustion: The receiver's rate limit is too low for the event volume

Events should also be sent directly to the DLQ (bypassing retries) when the failure is clearly non-transient — specifically on 4xx responses other than 429.

DLQ Architecture

Event Producer
      │
      ▼
  Delivery Queue ──(retry exhausted)──► Dead Letter Queue
      │                                       │
   Workers                              DLQ Inspector
      │                                       │
   Endpoint                         Alert → Ops Team
                                             │
                                    Manual Replay CLI
                                             │
                                     Delivery Queue (re-enqueue)
// NestJS-style DLQ handler
@Injectable()
class DeadLetterQueueService {
  constructor(
    private readonly db: DatabaseService,
    private readonly alertService: AlertService,
    private readonly deliveryQueue: DeliveryQueueService,
  ) {}

  async enqueue(event: FailedEvent): Promise<void> {
    await this.db.insert('dead_letter_queue', {
      event_id: event.id,
      endpoint_id: event.endpointId,
      payload: event.payload,
      failure_reason: event.lastError,
      attempt_count: event.attemptCount,
      last_attempted_at: event.lastAttemptedAt,
      enqueued_at: new Date(),
    });

    // Alert on first DLQ entry for an endpoint in the past hour
    await this.alertService.maybeSendDLQAlert(event.endpointId);
  }

  // Manual replay: re-enqueue with reset attempt counter
  async replay(dlqItemId: string): Promise<void> {
    const item = await this.db.findOne('dead_letter_queue', dlqItemId);
    if (!item) throw new NotFoundError(`DLQ item ${dlqItemId} not found`);

    await this.deliveryQueue.enqueue({
      ...item,
      attempt_count: 0,
      replayed_from_dlq: true,
      replayed_at: new Date(),
    });

    await this.db.update('dead_letter_queue', dlqItemId, {
      status: 'replayed',
      replayed_at: new Date(),
    });
  }

  // Batch replay — rate limited to avoid thundering herd
  async replayBatch(endpointId: string, batchSize: number = 100): Promise<number> {
    const items = await this.db.query(
      'SELECT * FROM dead_letter_queue WHERE endpoint_id = ? AND status = ? LIMIT ?',
      [endpointId, 'pending', batchSize]
    );

    // Stagger replays to avoid overwhelming the recovering endpoint
    for (const item of items) {
      await this.replay(item.id);
      await sleep(100); // 10 RPS max during replay
    }

    return items.length;
  }
}

Alerting Thresholds

Effective DLQ alerting avoids both alert fatigue (too sensitive) and missed incidents (too loose):

TriggerActionUrgency
First DLQ entry for endpointSlack notificationLow
5+ DLQ entries in 10 minutesPage on-callMedium
DLQ depth > 1000 eventsIncidentHigh
DLQ growing faster than drainingIncidentHigh
Same event ID in DLQ 3+ timesInvestigate payloadMedium

Observability Metrics

Key DLQ metrics to track:

  • dlq.depth — total events in queue (per endpoint)
  • dlq.enqueue_rate — events entering DLQ per minute
  • dlq.age_max — age of oldest DLQ event (indicates how long issues go unresolved)
  • dlq.replay_success_rate — fraction of replayed events that succeed
  • webhook.delivery_success_rate — overall delivery health (7/28-day views)
  • webhook.latency_p99 — time from event emission to delivery acknowledgment

Webhook vs. WebSocket Trade-offs

Protocol Comparison

DimensionWebhook (HTTP)WebSocket
Connection modelStateless, new connection per eventPersistent bidirectional connection
Latency50–500ms (connection overhead)<10ms (connection already open)
InfrastructureAny HTTP serverRequires sticky sessions or pub/sub layer
ScalabilityHorizontally simple — statelessRequires connection state management
ReliabilityBuilt-in retry infrastructureApplication-level reconnect logic needed
Push directionServer → Client onlyBidirectional
DebuggingHTTP logs, standard toolingRequires WebSocket-aware tooling
Firewall/proxyWorks through all firewallsMay be blocked; falls back to long-polling

When to Use Webhooks

Webhooks are the right choice when:

  • Events are infrequent and discrete: Order placed, payment confirmed, file uploaded
  • Receiver is a server, not a browser: B2B integrations, Zapier/n8n flows, CI/CD pipelines
  • Guaranteed delivery matters more than latency: Business-critical events that must not be lost
  • The sender and receiver are different organizations: No shared infrastructure, standard HTTP is the lingua franca
  • Receivers need to be independently scalable: Each subscriber processes at its own pace

When to Use WebSockets

WebSockets are the right choice when:

  • Sub-100ms latency is required: Live collaborative editing, real-time trading, multiplayer games
  • Bidirectional communication is needed: The client sends updates as well as receiving them
  • High message frequency: Hundreds of events per second per connection
  • Streaming AI responses: Token-by-token LLM output streaming, agent status updates
  • Live dashboards: Metrics, logs, activity feeds that update continuously

The Liveblocks Pattern: Building AI Agents on WebSockets

Liveblocks, which builds collaborative AI copilots, chose WebSockets over HTTP for their agent infrastructure. Their reasoning:

"Copilots need to do more than return text — they should call tools, render UI, and give users control when manual confirmation is required. WebSockets are especially valuable because every client stays in sync whenever a user acts."

This reflects a broader pattern: AI agents that need to maintain shared state with multiple clients (users, other agents, dashboards) benefit from the always-open connection that WebSockets provide.

Hybrid Architectures

Most production AI agent platforms use both:

External Systems                    AI Agent Platform
      │                                    │
      │  Webhooks (inbound events)         │
      ├──────────────────────────────────► │
      │                                    │
      │                           Internal event bus
      │                                    │
      │                              Agent Workers
      │                                    │
      │  WebSockets (real-time UI sync)    │
      │ ◄──────────────────────────────── │
      │                                    │
Browser Clients                     Web Dashboard

The pattern: receive external events via webhooks (reliable, authenticated, retry-enabled), deliver real-time updates to UI clients via WebSockets (low latency, bidirectional). Webhooks and WebSockets are complementary, not competing.

Server-Sent Events as a Middle Ground

SSE (Server-Sent Events) provides a one-directional push channel over HTTP/1.1 without the complexity of WebSocket upgrade negotiation. For AI agent platforms that only need to push events to browser clients (not receive client pushes), SSE is often sufficient:

  • No special proxy configuration required
  • Automatic reconnection built into the browser API
  • Works with standard HTTP caching and auth middleware
  • But: unidirectional only, no binary frame support

Event Ordering Guarantees

Why Ordering is Hard

Even if the sender emits events in strict order, delivery order is not guaranteed because:

  1. Parallel delivery workers: Event 1 and Event 2 are dispatched simultaneously by different workers; Event 2 may reach the endpoint first if Event 1 encounters a transient network issue
  2. Retry asymmetry: Event 1 fails and is retried at T+5s; Event 2 succeeds immediately. The receiver sees Event 2 before Event 1.
  3. CDN/proxy routing: Different HTTP connections may take different network paths with different latencies
  4. Multi-datacenter delivery: Events sent from geographically distributed infrastructure may arrive out of order

Svix, one of the leading webhook infrastructure providers, states directly: "You can't guarantee webhook ordering." This is not a limitation of Svix specifically — it is a fundamental property of distributed HTTP delivery.

Sequence Numbers

The best mitigation is including sequence numbers in event payloads:

interface WebhookEvent {
  id: string;               // Unique event ID (idempotency key)
  type: string;             // e.g. "order.updated"
  sequence: number;         // Monotonically increasing per entity
  entity_id: string;        // The entity this event is about
  entity_version: number;   // Entity state version after this event
  timestamp: string;        // ISO 8601
  data: unknown;            // Event payload
}

The receiver can then detect out-of-order delivery:

async function handleOrderEvent(event: WebhookEvent): Promise<void> {
  const current = await db.getEntityState(event.entity_id);

  // Reject events older than current state
  if (event.entity_version <= current.version) {
    console.log(`Stale event — current version ${current.version}, event version ${event.entity_version}`);
    return; // Acknowledge (don't retry), but don't apply
  }

  // Buffer if a gap exists
  if (event.entity_version > current.version + 1) {
    await eventBuffer.store(event);
    // Try to fetch the missing events via API polling
    await backfillMissingEvents(event.entity_id, current.version + 1, event.entity_version - 1);
    return;
  }

  // Apply in-order event
  await applyEvent(event);
  await drainBufferedEvents(event.entity_id, event.entity_version + 1);
}

Thin Payload + Pull Pattern

An alternative that sidesteps ordering entirely: send "thin" webhook payloads containing only the event type and entity ID, then have the receiver pull the full current state from an API:

// Thin payload approach
interface ThinWebhookPayload {
  event_type: 'order.updated';
  order_id: string;
  timestamp: string;
}

async function handleThinWebhook(payload: ThinWebhookPayload): Promise<void> {
  // Don't trust the payload for state — fetch authoritative current state
  const order = await apiClient.getOrder(payload.order_id);
  await updateLocalOrderState(order);
}

This sacrifices the efficiency of embedded payloads but guarantees the receiver always has the correct final state, regardless of delivery order. The trade-off: higher API call volume, but simpler correctness model.

Per-Entity Ordering Channels

For systems where ordering matters, partition the delivery queue by entity ID so that all events for a given entity are processed serially:

// Route events to worker based on entity_id hash
function getWorkerIndex(entityId: string, workerCount: number): number {
  const hash = murmurhash(entityId);
  return hash % workerCount;
}

// All events for entity "order-123" always go to the same worker
// Preserving intra-entity ordering even with multiple workers

This preserves ordering within an entity while allowing parallel processing across different entities.


Rate Limiting Outbound Webhooks

Why Rate Limiting Matters

Without rate limits, a platform that generates bursts of events (e.g., bulk imports, batch operations, end-of-day processing) will deliver those bursts directly to subscriber endpoints. A subscriber receiving 10,000 events per minute when their endpoint handles 100 per minute will cascade into failures, DLQ overflow, and potential outages for the subscriber.

Rate limiting is about protecting receivers, not senders.

Token Bucket Implementation

The token bucket algorithm is well-suited for webhook rate limiting because it smooths bursts while allowing short-term spikes:

class TokenBucketRateLimiter {
  private tokens: number;
  private readonly maxTokens: number;
  private readonly refillRatePerMs: number;
  private lastRefillTime: number;

  constructor(
    private readonly endpointId: string,
    maxRequestsPerMinute: number,
  ) {
    this.maxTokens = maxRequestsPerMinute;
    this.tokens = maxRequestsPerMinute;
    this.refillRatePerMs = maxRequestsPerMinute / 60_000;
    this.lastRefillTime = Date.now();
  }

  canDeliver(): boolean {
    this.refill();
    if (this.tokens >= 1) {
      this.tokens -= 1;
      return true;
    }
    return false;
  }

  timeUntilNextToken(): number {
    this.refill();
    if (this.tokens >= 1) return 0;
    return (1 - this.tokens) / this.refillRatePerMs;
  }

  private refill(): void {
    const now = Date.now();
    const elapsed = now - this.lastRefillTime;
    const newTokens = elapsed * this.refillRatePerMs;
    this.tokens = Math.min(this.maxTokens, this.tokens + newTokens);
    this.lastRefillTime = now;
  }
}

Per-Destination Configuration

Different subscriber endpoints have different capacity. A rate limiting system should support per-destination configuration:

interface EndpointRateLimitConfig {
  endpoint_id: string;
  max_requests_per_minute: number;
  burst_allowance: number;      // Allow short bursts above the rate
  behavior_on_limit: 'queue' | 'drop' | 'error';
}

// Hookdeck's approach: adjustable per destination
// "Each user can have multiple destinations with different rate limits"

Handling 429 Responses

When a receiver returns 429, the webhook infrastructure should:

  1. Immediately pause delivery to that endpoint
  2. Check the Retry-After header — if present, wait exactly that long
  3. If no Retry-After, apply exponential backoff
  4. Resume delivery at a reduced rate after the wait period
async function handleRateLimitResponse(
  response: Response,
  endpoint: WebhookEndpoint,
): Promise<number> {
  const retryAfter = response.headers.get('Retry-After');

  if (retryAfter) {
    // Could be a timestamp or delay in seconds
    const isTimestamp = isNaN(Number(retryAfter));
    if (isTimestamp) {
      return new Date(retryAfter).getTime() - Date.now();
    }
    return parseInt(retryAfter, 10) * 1000;
  }

  // No Retry-After — use endpoint's current backoff
  return computeBackoffDelay(endpoint.consecutiveRateLimitHits);
}

Webhook Infrastructure at Scale

Stripe's Architecture

Stripe processes billions of webhook deliveries monthly. Key architectural decisions:

  1. Async decoupling: Events are persisted immediately after creation; delivery is handled by a separate async pipeline. The API response does not wait for webhook delivery.

  2. Standardized envelope: Every Stripe webhook event has consistent fields (id, type, created, livemode, data.object), enabling generic processing without custom parsers per event type.

  3. Resilient delivery with DLQs: Stripe uses AWS SQS with DLQ integration for resilient event delivery. Events that fail all retries go to an SQS Dead Letter Queue with configurable retry policies.

  4. Idempotent event IDs: evt_* identifiers are stable across retries. Consumers can safely use them as idempotency keys.

  5. Webhook endpoint health monitoring: Stripe automatically disables endpoints that consistently fail (after 72 hours of failures) and sends email notifications.

GitHub's Architecture

GitHub's webhook system emphasizes simplicity and observability:

  1. 10-second acknowledgment window: GitHub expects endpoints to return 2xx within 10 seconds. Process asynchronously — immediately acknowledge receipt, then process in a background job.

  2. 30-day delivery history: GitHub exposes webhook delivery history via API, enabling subscribers to replay failed deliveries manually.

  3. Per-repository and organization-level hooks: Allows fine-grained event routing at the source rather than requiring receivers to filter.

  4. X-GitHub-Delivery header: A unique UUID per delivery attempt, usable as an idempotency key.

  5. Recent delivery logs: The GitHub UI shows recent deliveries with request/response details — invaluable for debugging.

Slack's Architecture

Slack's event API highlights challenges unique to high-volume platforms:

  1. 3-second acknowledgment requirement: Even stricter than GitHub. Slack's platform retries if no 2xx within 3 seconds.

  2. Challenge verification: Before enabling webhooks, Slack sends a challenge request that the receiver must echo back — a one-time verification that the endpoint is legitimate.

  3. retry_num and retry_reason headers: Slack includes these headers on retry deliveries, enabling receivers to distinguish first delivery from retries without relying on event ID deduplication alone.

  4. Event deduplication via event_id: X-Slack-Retry-Num: 1 combined with the stable event_id gives receivers everything they need for idempotent processing.

Svix: Infrastructure for Webhook Providers

Svix provides webhook infrastructure-as-a-service for teams building webhook systems. Their architecture insights:

  • Message queue + workers: Decouple ingestion from delivery; workers pull from a queue and attempt delivery, enabling horizontal scaling of delivery capacity independently of ingestion
  • Automatic endpoint management: Track endpoint health, auto-disable consistently failing endpoints, notify application owners
  • Consumer portal: Embeddable UI that shows subscribers their own delivery logs, allowing self-service debugging without contacting support

General Scalable Architecture Pattern

                    ┌─────────────────────────────────┐
                    │        Event Producer             │
                    │   (API, DB trigger, scheduler)    │
                    └──────────────┬──────────────────┘
                                   │ Persist event
                                   ▼
                    ┌─────────────────────────────────┐
                    │        Events Database            │
                    │   (authoritative event store)     │
                    └──────────────┬──────────────────┘
                                   │ Fan-out
                                   ▼
                    ┌─────────────────────────────────┐
                    │      Delivery Queue (per tenant) │
                    │         (Redis / SQS / Kafka)    │
                    └──────────────┬──────────────────┘
                    ┌──────────────┼──────────────────┐
                    ▼              ▼                   ▼
             Worker 1         Worker 2            Worker N
                    │              │                   │
             Rate limiter    Rate limiter       Rate limiter
             Circuit breaker Circuit breaker    Circuit breaker
                    │              │                   │
             Endpoint A      Endpoint B          Endpoint C
                    │
             (on failure)
                    ▼
            Dead Letter Queue
                    │
            Alert + Manual Replay

Testing Webhook Integrations

The Testing Challenge

Webhooks present a unique testing challenge: the sender is a third-party system outside your control. You can't simply call a function in a unit test — you need an external system to push an HTTP request to your endpoint.

Local Tunnels with ngrok

ngrok creates a secure tunnel from a public HTTPS URL to your local development server:

# Install ngrok
npm install -g ngrok

# Expose local port 3000
ngrok http 3000

# Output:
# Forwarding  https://abc123.ngrok.io -> http://localhost:3000

With the public URL, you can configure real webhook providers (Stripe, GitHub, etc.) to deliver to your local environment. ngrok's web interface at http://localhost:4040 shows all incoming requests with full headers and bodies — invaluable for debugging.

ngrok CLI for automated testing:

# Capture and replay webhooks
ngrok http 3000 --log=stdout | \
  grep "url=" | \
  head -1 | \
  sed 's/.*url=//'

Alternatives to ngrok:

  • Cloudflare Tunnel: Free, no rate limits, works with Cloudflare's global network
  • LocalTunnel: Open-source, self-hostable
  • Smee.io: Lightweight, purpose-built for webhook proxying
  • Webhook Relay: Persistent URLs even when your tunnel restarts

Mocking Webhook Payloads

For unit and integration tests that don't require live external systems:

// Jest example: mock webhook delivery
import { createHmac } from 'crypto';
import { webhookHandler } from './webhook-handler';

describe('Webhook Handler', () => {
  const secret = 'test-webhook-secret';

  function createStripeWebhookRequest(payload: object): {
    body: Buffer;
    headers: Record<string, string>;
  } {
    const body = Buffer.from(JSON.stringify(payload));
    const timestamp = Math.floor(Date.now() / 1000);
    const signedPayload = `${timestamp}.${body.toString('utf8')}`;
    const signature = createHmac('sha256', secret)
      .update(signedPayload)
      .digest('hex');

    return {
      body,
      headers: {
        'stripe-signature': `t=${timestamp},v1=${signature}`,
        'content-type': 'application/json',
      },
    };
  }

  it('processes payment.succeeded idempotently', async () => {
    const payload = {
      id: 'evt_test_123',
      type: 'payment_intent.succeeded',
      data: { object: { id: 'pi_test_456', amount: 2000 } },
    };

    const req = createStripeWebhookRequest(payload);

    // First delivery
    const result1 = await webhookHandler(req.body, req.headers);
    expect(result1.status).toBe(200);

    // Duplicate delivery — should not double-process
    const result2 = await webhookHandler(req.body, req.headers);
    expect(result2.status).toBe(200);

    // Verify business effect occurred exactly once
    const payment = await db.findPayment('pi_test_456');
    expect(payment.processedCount).toBe(1);
  });

  it('rejects replayed requests outside tolerance window', async () => {
    const payload = { id: 'evt_replay_test', type: 'payment_intent.succeeded' };
    const staleTimestamp = Math.floor(Date.now() / 1000) - 600; // 10 minutes ago
    const body = Buffer.from(JSON.stringify(payload));
    const signature = createHmac('sha256', secret)
      .update(`${staleTimestamp}.${body}`)
      .digest('hex');

    const req = {
      body,
      headers: { 'stripe-signature': `t=${staleTimestamp},v1=${signature}` },
    };

    const result = await webhookHandler(req.body, req.headers);
    expect(result.status).toBe(400);
  });
});

End-to-End Testing with Wiremock

For integration tests in CI/CD pipelines without external dependencies:

# docker-compose.test.yml
services:
  wiremock:
    image: wiremock/wiremock:latest
    ports:
      - "8080:8080"
    volumes:
      - ./test/wiremock:/home/wiremock
// test/wiremock/mappings/stripe-webhook.json
{
  "request": {
    "method": "POST",
    "url": "/stripe/webhook"
  },
  "response": {
    "status": 200,
    "body": "{\"received\": true}"
  }
}

Testing with Stripe CLI

# Stripe CLI — forward real Stripe events to local endpoint
stripe listen --forward-to localhost:3000/webhooks/stripe

# Trigger specific event types for testing
stripe trigger payment_intent.succeeded
stripe trigger customer.subscription.deleted

This combines the realism of actual Stripe event payloads with the convenience of local development.

CI/CD Webhook Testing Checklist

  • Unit tests for signature verification (valid, invalid, replay attack)
  • Unit tests for idempotent processing (first delivery, duplicate delivery)
  • Integration tests with mocked payloads for all handled event types
  • Load tests for high-volume scenarios
  • Chaos tests: what happens when the webhook handler is slow (>3s)?
  • Contract tests: does the payload schema match the documented API?

Emerging Standards

Standard Webhooks

The Standard Webhooks initiative (standardwebhooks.com), led by Svix with backing from Zapier, Twilio, Lob, Mux, ngrok, Supabase, and Kong, aims to eliminate the fragmentation in webhook implementations. The specification defines:

Required Headers:

HeaderDescription
webhook-idUnique message ID — same across retries, use as idempotency key
webhook-timestampUnix timestamp — integer seconds since epoch
webhook-signatureHMAC-SHA256 signatures (may include multiple for key rotation)

Signature Format:

webhook-signature: v1,<base64-encoded-hmac-sha256>

Multiple signatures (for key rotation):

webhook-signature: v1,base64_sig1 v1,base64_sig2

Signed Payload Construction:

{webhook-id}.{webhook-timestamp}.{body}

Why Standard Webhooks Matters:

Instead of each SDK implementing platform-specific signature verification, a single StandardWebhooks library covers all compliant providers:

import { Webhook } from 'standardwebhooks';

const wh = new Webhook(secret);

// Works for any Standard Webhooks-compliant provider
const event = wh.verify(rawBody, headers);

CloudEvents

CloudEvents is a CNCF graduated project (January 2024) that standardizes the envelope format for event data:

{
  "specversion": "1.0",
  "type": "com.example.order.created",
  "source": "https://example.com/orders",
  "subject": "order-123",
  "id": "evt-abc123",
  "time": "2026-02-26T10:00:00Z",
  "datacontenttype": "application/json",
  "data": {
    "order_id": "order-123",
    "total": 9999
  }
}

CloudEvents also defines an HTTP webhook binding specification that includes:

  • WebHook-Request-Origin header for abuse protection (sender identifies itself)
  • WebHook-Request-Rate header for rate negotiation
  • A handshake validation mechanism to prevent SSRF-style abuse

CloudEvents is gaining adoption in cloud-native infrastructure (Azure Event Grid, Google Eventarc, Knative Eventing, AWS EventBridge all support it), making it particularly relevant for AI agents deployed in cloud environments.

The Convergence

Standard Webhooks focuses on security and delivery mechanics (signatures, idempotency). CloudEvents focuses on event semantics (envelope format, source/type taxonomy). They are complementary: a CloudEvents payload can be delivered via a Standard Webhooks-compliant transport.


AI Agent-Specific Considerations

Autonomous Action Risk

Traditional webhook consumers are humans or deterministic code — they either process events correctly or fail visibly. AI agents operate differently: they may take autonomous actions (sending messages, modifying data, calling external APIs) based on webhook content. The consequences of duplicate delivery are therefore more severe:

  • A duplicate task.assigned event may cause an agent to send two notifications to a user
  • A duplicate payment.confirmed event may trigger two fulfillment workflows
  • A duplicate message.received event may generate two AI responses in a chat thread

Mitigation: Implement idempotency at every layer. The webhook receiver should be idempotent, but the downstream agent actions should also be idempotent or guarded by unique action IDs.

Long-Running Agent Tasks

AI agents often handle long-running tasks (minutes to hours). Webhook delivery systems expect acknowledgment within seconds (typically 3–30 seconds). The solution: immediate acknowledgment + async processing:

// Acknowledge immediately, process asynchronously
app.post('/webhooks/agent-task', async (req, res) => {
  // 1. Verify signature immediately
  if (!verifySignature(req)) {
    return res.status(401).json({ error: 'Invalid signature' });
  }

  // 2. Persist event to durable queue
  const jobId = await taskQueue.enqueue({
    eventId: req.body.id,
    payload: req.body,
    receivedAt: new Date(),
  });

  // 3. Acknowledge within seconds — don't wait for agent processing
  res.status(202).json({
    received: true,
    job_id: jobId,
    // Optionally provide a status endpoint for polling
    status_url: `/jobs/${jobId}/status`,
  });

  // Processing happens asynchronously in background workers
});

Agent-to-Agent Webhooks

As multi-agent systems become common, agents receive webhooks from other agents — not just from external platforms. Unique considerations:

  1. Recursive call loops: Agent A receives an event and calls Agent B; Agent B's response triggers an event that calls Agent A again. Implement loop detection via call-chain headers:
interface AgentWebhookHeaders {
  'x-agent-call-chain': string;   // Comma-separated agent IDs
  'x-agent-call-depth': string;   // Current depth as a string
  'x-agent-trace-id': string;     // Root trace ID
}

function detectCallLoop(headers: AgentWebhookHeaders, agentId: string): boolean {
  const chain = headers['x-agent-call-chain']?.split(',') ?? [];
  return chain.includes(agentId);
}
  1. Trust levels: Agent-to-agent webhooks require authentication — use per-agent API keys or short-lived JWT tokens, not a single shared secret.

  2. Event schema versioning: Agents may be updated independently. Include a schema version field and handle backward compatibility:

interface AgentEvent {
  schema_version: '1.0' | '1.1' | '2.0';
  // ... other fields
}

Webhook Fan-Out for Multi-Tenant Agent Platforms

Platforms hosting multiple AI agents (like BotsHub) need to route incoming webhooks to the correct agent:

// Fan-out: one incoming webhook → multiple agent handlers
async function fanOutWebhook(event: WebhookEvent): Promise<void> {
  // Find all agents subscribed to this event type
  const subscribers = await db.findSubscribersForEvent({
    tenantId: event.tenantId,
    eventType: event.type,
  });

  // Deliver to each subscriber independently
  // Failure for one subscriber doesn't affect others
  await Promise.allSettled(
    subscribers.map(subscriber =>
      deliveryQueue.enqueue({
        eventId: `${event.id}:${subscriber.agentId}`,
        agentId: subscriber.agentId,
        payload: event,
      })
    )
  );
}

Key principle: fan-out copies must have independent delivery tracking. If delivery to Agent A fails, it should not block or affect delivery to Agent B.

Observability for Agent Webhook Processing

Traditional webhook observability tracks delivery from sender to receiver. For agent platforms, you need to track the full chain:

External Event → Webhook Received → Agent Task Created →
Agent Processing Started → Tool Calls Made →
Agent Response Generated → Side Effects Applied

Each step should be traceable via a shared trace_id derived from the original webhook event ID. This enables debugging questions like: "Why did this payment webhook cause the agent to send 3 messages to the user?"


Practical Recommendations

For Webhook Producers (Building Webhook Systems)

  1. Adopt Standard Webhooks headers: Use webhook-id, webhook-timestamp, and webhook-signature. This gives consumers a standard library path for verification.

  2. Include sequence numbers and entity version numbers in payloads. This gives consumers the data they need to detect and handle out-of-order delivery.

  3. Implement multi-stage retry with exponential backoff and jitter: Immediate retry → short backoff → long backoff → DLQ. Use 7 attempts over 24 hours as a reasonable default.

  4. Support multiple active secrets for key rotation. Never require subscribers to update their secret during a maintenance window.

  5. Expose delivery logs via API. Subscribers debugging failures should not need to contact your support team.

  6. Use circuit breakers per endpoint, not per event. When an endpoint goes unhealthy, hold events in a buffer rather than dropping them.

  7. Provide a test/replay mechanism. A POST /webhooks/{event_id}/redeliver endpoint makes subscriber debugging dramatically easier.

For Webhook Consumers (Building Receivers)

  1. Verify signatures before processing. Use constant-time comparison. Reject requests outside the 5-minute timestamp window.

  2. Acknowledge immediately, process asynchronously. Return 2xx within the provider's timeout window (usually 3–30s). Enqueue to a durable queue, then process in background workers.

  3. Store processed event IDs with a 24–72-hour TTL. Check this store before processing to handle at-least-once delivery correctly.

  4. Handle 4xx failures from your own code carefully. Returning 500 causes the sender to retry (possibly desirable); returning 400 tells the sender the event is unprocessable (will land in DLQ). Choose intentionally.

  5. Design for out-of-order delivery. Use entity version numbers to detect stale events. Implement buffering for out-of-order events when ordering matters.

  6. Monitor webhook processing latency and error rates. Alert when DLQ depth grows. Track time-to-process as a leading indicator of capacity issues.

For AI Agent Platforms Specifically

  1. Treat idempotency as a first-class concern at every layer: receiver idempotency, agent task idempotency, and downstream action idempotency.

  2. Use hybrid WebSocket + webhook architectures: webhooks for reliable delivery of discrete events from external systems, WebSockets for real-time bidirectional communication with UI clients and agent dashboards.

  3. Implement call-chain tracking for agent-to-agent webhooks to prevent infinite loops.

  4. Track the full processing chain with a shared trace ID from webhook receipt to side effects. This is essential for debugging autonomous agent behavior.

  5. Use fan-out with independent delivery tracking when routing events to multiple agents. One agent's failure should never block another's delivery.

  6. Apply rate limits both outbound and inbound. Protect your subscribers from burst traffic (outbound limiting) and protect your agents from being overwhelmed (inbound rate limiting with backpressure).

  7. Evaluate CloudEvents adoption if operating in cloud-native infrastructure. Azure Event Grid, Google Eventarc, and Knative all speak CloudEvents — the standard envelope eliminates custom parsing per provider.


Summary Reference Table

ConcernRecommendationKey Tools/Standards
Delivery semanticsAt-least-once + idempotent consumersRedis SET NX, DB unique constraints
Retry strategyExponential backoff with full jitter7 attempts over 24h
Circuit breakerPer-endpoint, not per-eventCLOSED → OPEN → HALF_OPEN
Signature verificationHMAC-SHA256 + timestampStandard Webhooks spec
Replay prevention5-minute timestamp windowConstant-time comparison
Failed deliveryDead letter queue with alertingSQS DLQ, custom DLQ
Event orderingSequence numbers + entity versionBuffer + backfill pattern
Rate limitingToken bucket per destination429 + Retry-After handling
Real-time UIWebSocket in addition to webhooksHybrid architecture
StandardizationStandard Webhooks + CloudEventsstandardwebhooks.com, cloudevents.io
Testingngrok + Stripe CLI + mocked payloadsJest, Wiremock
AI agentsAsync acknowledgment + trace ID chainFan-out with independent tracking

References