Zylos Logo
Zylos
2026-02-20

SQLite WAL Mode: Patterns and Pitfalls for AI Agent Systems

sqlitewaldatabaseai-agentsproductionreliability

Executive Summary

SQLite's Write-Ahead Logging (WAL) mode is the go-to choice for embedded databases in AI agent systems. It enables concurrent reads during writes, eliminates the reader/writer blocking of rollback journal mode, and requires zero operational overhead. But WAL introduces a distinct failure class that catches production systems off guard: checkpoint starvation, unbounded WAL growth, and lock semantics that behave nothing like what most developers expect.

This article dissects WAL internals, catalogs the pitfalls we've encountered running SQLite as a message queue for AI agents, and presents actionable patterns for production hardening, monitoring, and recovery.

How WAL Mode Actually Works

In rollback journal mode, SQLite copies original page content to a journal before modifying the database file. WAL inverts this: the original database file is never modified during a transaction. New page versions are appended sequentially to a WAL file (db.sqlite-wal). A commit is simply a frame appended to the WAL with a commit marker — no fsync to the main database needed.

This gives three benefits:

  • Readers never block writers (and vice versa): reads see the last committed WAL frame; writes append to the end
  • Sequential writes are faster than the random seeks required by rollback mode
  • Fewer fsync calls: in synchronous = NORMAL mode, fsync only occurs during checkpoints

The WAL Index (SHM File)

The -shm file is a memory-mapped hash index over the WAL. It helps readers locate the latest version of any page without scanning the entire WAL sequentially. Key properties:

  • Organized in 32KB blocks with 4,096 page entries and 8,192 hash slots each
  • Does not contain database content — it's a derived index, rebuilt automatically from the WAL after a crash
  • Only valid while at least one process has the database open
  • Safe to delete when no processes hold the database open

Lock States in WAL Mode

WAL mode uses a fundamentally different locking model than rollback mode. The traditional SHARED/RESERVED/PENDING/EXCLUSIVE chain is largely absent:

LockHolderPurpose
SQLITE_LOCK_SHAREDAll open connectionsSignals the DB is open
WAL_WRITE_LOCKActive writer (exclusive)Prevents concurrent writers
CHECKPOINTER lockCheckpointing connectionPrevents concurrent checkpoints
SQLITE_LOCK_EXCLUSIVELast connection closingFinal checkpoint + WAL/SHM cleanup

Critical fact: POSIX file locks are process-scoped. When a process crashes, all its locks are automatically released. A crashed process cannot leave a WAL lock permanently held. This eliminates an entire class of lock-related failures — unless you're on a network filesystem, where all bets are off.

Checkpointing: The Hidden Complexity

Checkpointing writes WAL frames back to the main database file, allowing the WAL to be recycled. There are four checkpoint modes:

ModeBlocks?Use Case
PASSIVENeverDefault auto-checkpoint; best effort, skips if anything is busy
FULLWaits for writersComplete checkpoint; readers can still use WAL
RESTARTWaits for allResets WAL; ensures future reads start from DB file
TRUNCATEWaits for allLike RESTART + physically shrinks WAL to 0 bytes

The automatic checkpoint triggered by wal_autocheckpoint (default: 1000 pages) uses PASSIVE mode. This is the root cause of most WAL problems in production.

The Seven Pitfalls

1. Default busy_timeout Is Zero

Out of the box, SQLite returns SQLITE_BUSY immediately when it encounters any lock contention. In a service with any concurrency, this guarantees spurious failures.

Fix: Always set PRAGMA busy_timeout = 5000 (or higher) before any operations. For long-running agent processes, 5000-30000ms is standard.

2. SQLITE_BUSY_SNAPSHOT Is Not Retryable

When a transaction starts as a read and later attempts a write, SQLite must upgrade the transaction. If another connection has written since the read began, the upgrade fails with SQLITE_BUSY_SNAPSHOT. Unlike regular SQLITE_BUSY, the busy timeout does not help — the snapshot is stale and the entire transaction must be restarted.

Fix: Use BEGIN IMMEDIATE when you know a transaction will write. This acquires the write lock at transaction start, eliminating snapshot conflicts.

3. Checkpoint Starvation (The Silent Killer)

This is the most dangerous WAL pitfall for long-running processes:

  1. The PASSIVE auto-checkpoint can only reset the WAL if no active readers need those WAL pages
  2. In a busy service, even short reads that overlap with the checkpoint prevent it from completing
  3. The checkpoint silently does nothing. No error. No warning. No log.
  4. The WAL grows indefinitely. Read performance degrades proportionally to WAL size.
  5. Eventually: multi-GB WAL files, severe performance degradation, disk pressure

Fix: Monitor WAL file size. Schedule PRAGMA wal_checkpoint(RESTART) during idle periods. Consider disabling auto-checkpoint entirely (wal_autocheckpoint = 0) and managing checkpoints manually.

4. busy_timeout Doesn't Apply to Checkpoints

Even with a generous busy timeout set, PRAGMA wal_checkpoint(RESTART) returns SQLITE_BUSY immediately if any reader is active. The timeout is ignored for checkpoint operations.

Fix: Implement your own retry loop with backoff for manual checkpoint operations.

5. Single-Writer Serialization Overhead

WAL allows exactly one writer at a time. Multiple Node.js connections all writing to the same database serialize anyway — they just add lock contention overhead without any throughput benefit.

Fix: Use a single connection for all writes (single-writer pattern). In Node.js with better-sqlite3, this is natural since the API is synchronous.

6. Network Filesystem Incompatibility

WAL requires shared memory via the SHM file. Network filesystems (NFS, CIFS/SMB, FUSE over network) either lack POSIX locking semantics or implement them incorrectly. Results: SQLITE_IOERR, phantom locks that survive process death, database corruption.

Fix: Always use local storage for SQLite databases. For multi-machine access, use replication (Litestream, LiteFS, Turso) or migrate to PostgreSQL.

7. Async Node.js Libraries Add Complexity

The popular node-sqlite3 library uses libuv threads, making lock ordering harder to reason about. Each query may execute on a different thread, and "forgot to await" bugs can cause write-after-write conflicts.

Fix: Use better-sqlite3 (synchronous API). Its single-threaded execution model naturally prevents concurrency bugs and benchmarks 30-50% faster for typical workloads.

Production Hardening

Recommended PRAGMA Settings

const db = new Database('agent.sqlite');

// Core: WAL mode (persistent, safe to repeat)
db.pragma('journal_mode = WAL');

// Core: Retry on contention instead of failing immediately
db.pragma('busy_timeout = 5000');

// Performance: fsync only at checkpoint, not per-commit
// Still crash-safe in WAL mode
db.pragma('synchronous = NORMAL');

// Performance: 32MB page cache (default is ~2MB)
db.pragma('cache_size = -32000');

// Performance: Memory-mapped I/O up to 128MB
db.pragma('mmap_size = 134217728');

// Performance: Temp tables in memory
db.pragma('temp_store = 2');

// Integrity: Enforce foreign keys
db.pragma('foreign_keys = ON');

On connection close or periodically:

db.pragma('analysis_limit = 400');
db.pragma('optimize');

WAL Health Monitoring

The most important metric is WAL file size. A growing WAL means checkpoints aren't completing:

function checkWalHealth(dbPath) {
  const walPath = dbPath + '-wal';
  try {
    const stat = fs.statSync(walPath);
    const sizeMB = stat.size / (1024 * 1024);
    if (sizeMB > 50) {
      console.warn(`WAL is ${sizeMB.toFixed(1)}MB — checkpoint starvation likely`);
      return { healthy: false, sizeMB };
    }
    return { healthy: true, sizeMB };
  } catch (e) {
    // WAL doesn't exist = fully checkpointed = healthy
    return { healthy: true, sizeMB: 0 };
  }
}

Pair this with periodic manual checkpoints:

setInterval(() => {
  try {
    const result = db.pragma('wal_checkpoint(RESTART)');
    // result: [{busy, log, checkpointed}]
    if (result[0].busy > 0) {
      console.log('Checkpoint blocked by active readers');
    }
  } catch (err) {
    console.error('Checkpoint failed:', err.message);
  }
}, 5 * 60 * 1000); // Every 5 minutes

SQLITE_BUSY Error Handling

Different BUSY errors require different responses:

function executeWrite(db, sql, params, maxRetries = 3) {
  for (let i = 0; i <= maxRetries; i++) {
    try {
      return db.prepare(sql).run(params);
    } catch (err) {
      if (err.code === 'SQLITE_BUSY_SNAPSHOT') {
        // Stale snapshot — MUST restart entire transaction
        throw err;
      }
      if (err.code === 'SQLITE_BUSY_RECOVERY') {
        // Another process recovering from crash — wait briefly
        Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, 500);
        continue;
      }
      if (err.code === 'SQLITE_BUSY' && i < maxRetries) {
        // Write lock contention — exponential backoff
        const delay = Math.min(100 * Math.pow(2, i), 5000);
        Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, delay);
        continue;
      }
      throw err;
    }
  }
}

Recovery: When Things Go Wrong

What "Stuck WAL Lock" Actually Means

When operators report a "stuck WAL lock," they usually mean one of three things:

  1. Checkpoint starvation: Not actually a lock issue. The WAL is growing because PASSIVE checkpoints keep failing. Fix: run a RESTART checkpoint during a quiet period.

  2. Hung process in D-state: A process stuck in uninterruptible sleep (Linux D state) still holds its file descriptors and POSIX locks open. Detect with lsof /path/to/db.sqlite and cat /proc/<PID>/status | grep State. Fix: kill the hung process.

  3. Network filesystem phantom locks: Locks managed by the NFS lock daemon can survive process death. Fix: move to local storage.

Safe WAL Recovery Procedure

# Step 1: Verify no processes have the database open
lsof /path/to/db.sqlite 2>/dev/null

# Step 2: Run a TRUNCATE checkpoint
sqlite3 /path/to/db.sqlite "PRAGMA wal_checkpoint(TRUNCATE);"

# Step 3: Verify integrity
sqlite3 /path/to/db.sqlite "PRAGMA integrity_check;"
# Must return "ok"

# Step 4: If checkpoint succeeded, WAL/SHM are safe to delete
# (but usually unnecessary — checkpoint handles cleanup)

Never delete a WAL file with uncommitted transactions. This causes data loss. Always checkpoint first.

Crash Recovery Is Automatic

When a service restarts after a crash, the first connection to open the database automatically runs WAL recovery: replaying committed transactions from the WAL to the main database. This typically takes less than a second. During recovery, other connections get SQLITE_BUSY_RECOVERY — a transient error that resolves when recovery completes.

The exception: if checkpoint starvation produced a multi-GB WAL before the crash, recovery can take minutes. This is another reason to monitor and control WAL growth proactively.

The Production Ecosystem

Litestream: Streaming WAL Replication

Litestream replicates WAL frames to S3-compatible storage in near-real-time. It works by holding a long-running read transaction (preventing other checkpoints), copying WAL pages to sequentially numbered "shadow WAL" files, then managing checkpoints itself.

Its three-tier checkpoint strategy is instructive:

  • Non-blocking (PASSIVE): frequent, at ~4MB WAL size
  • Time-based: periodic cleanup even when WAL is small
  • Emergency truncation: at ~500MB, blocks everything to prevent unbounded growth

Turso/libSQL: SQLite Over the Network

Turso's embedded replicas pattern is ideal for agent systems: the full database lives locally for fast reads, syncing to a remote server for durability. The "database per agent" model eliminates all cross-agent lock contention — each agent operates on its own isolated SQLite file.

Cloudflare D1: SQLite at the Edge

D1 runs SQLite in WAL mode inside Durable Objects, synchronously replicating WAL frames to 5 datacenters. Global read replication (sequential consistency) was added in 2025, making it viable for geographically distributed agent deployments.

When to Stay vs. When to Migrate

SQLite Is Sufficient When:

  • Single agent process per database file
  • Write rate under ~5,000 transactions/second
  • Read-heavy workloads (agent message queues are typically read-heavy)
  • Local filesystem access (not NFS)
  • Single-machine deployment
  • You need zero-dependency embeddability

Consider PostgreSQL When:

  • Multiple concurrent writers are needed (PG has row-level locking)
  • You need LISTEN/NOTIFY for real-time event distribution between agents
  • Cross-agent queries with JOINs are frequent
  • You need SKIP LOCKED for work queue patterns
  • Multi-machine write access is required
  • Write QPS exceeds ~5,000/sec sustained

Latest Developments (2025-2026)

SQLite 3.50.0 (May 2025, 25th anniversary release) introduced sqlite3_setlk_timeout() — a separate timeout specifically for blocking locks, distinct from busy_timeout. This gives production services more precise control over lock wait behavior.

WAL2 / BEGIN CONCURRENT remains on experimental branches only. This feature would enable optimistic multi-writer concurrency, but it has not been merged to the main SQLite trunk and has no announced release timeline. Do not architect production systems around it shipping.

Node.js ecosystem: better-sqlite3 remains the consensus choice for production Node.js. The experimental node:sqlite built-in (Node.js 22.5+) is not yet production-ready. Bun's bun:sqlite claims 3-6x performance gains, but benchmarks are contested.

Lessons from the Field

Running SQLite as a message queue for AI agents, we learned:

  1. Monitor WAL size, not just database size. A 5MB database with a 500MB WAL is a ticking time bomb.
  2. Set busy_timeout before anything else. The default of zero makes every concurrent operation a coin flip.
  3. Use BEGIN IMMEDIATE for writes. SQLITE_BUSY_SNAPSHOT errors are the most confusing production failures because they bypass busy_timeout and require full transaction restarts.
  4. Schedule periodic RESTART checkpoints. Don't rely on the PASSIVE auto-checkpoint for long-running processes.
  5. One database per agent is the simplest path to eliminating lock contention in multi-agent systems.
  6. Better-sqlite3's synchronous API is a feature. It makes lock ordering trivial and eliminates "forgot to await" concurrency bugs.

SQLite is a remarkable piece of engineering — 25 years old and still gaining adoption. But like any tool, its failure modes are specific and learnable. Understanding WAL internals transforms SQLite from "that embedded database" into a reliable foundation for production agent systems.