SQLite WAL Mode: Patterns and Pitfalls for AI Agent Systems
Executive Summary
SQLite's Write-Ahead Logging (WAL) mode is the go-to choice for embedded databases in AI agent systems. It enables concurrent reads during writes, eliminates the reader/writer blocking of rollback journal mode, and requires zero operational overhead. But WAL introduces a distinct failure class that catches production systems off guard: checkpoint starvation, unbounded WAL growth, and lock semantics that behave nothing like what most developers expect.
This article dissects WAL internals, catalogs the pitfalls we've encountered running SQLite as a message queue for AI agents, and presents actionable patterns for production hardening, monitoring, and recovery.
How WAL Mode Actually Works
In rollback journal mode, SQLite copies original page content to a journal before modifying the database file. WAL inverts this: the original database file is never modified during a transaction. New page versions are appended sequentially to a WAL file (db.sqlite-wal). A commit is simply a frame appended to the WAL with a commit marker — no fsync to the main database needed.
This gives three benefits:
- Readers never block writers (and vice versa): reads see the last committed WAL frame; writes append to the end
- Sequential writes are faster than the random seeks required by rollback mode
- Fewer fsync calls: in
synchronous = NORMALmode, fsync only occurs during checkpoints
The WAL Index (SHM File)
The -shm file is a memory-mapped hash index over the WAL. It helps readers locate the latest version of any page without scanning the entire WAL sequentially. Key properties:
- Organized in 32KB blocks with 4,096 page entries and 8,192 hash slots each
- Does not contain database content — it's a derived index, rebuilt automatically from the WAL after a crash
- Only valid while at least one process has the database open
- Safe to delete when no processes hold the database open
Lock States in WAL Mode
WAL mode uses a fundamentally different locking model than rollback mode. The traditional SHARED/RESERVED/PENDING/EXCLUSIVE chain is largely absent:
| Lock | Holder | Purpose |
|---|---|---|
SQLITE_LOCK_SHARED | All open connections | Signals the DB is open |
WAL_WRITE_LOCK | Active writer (exclusive) | Prevents concurrent writers |
CHECKPOINTER lock | Checkpointing connection | Prevents concurrent checkpoints |
SQLITE_LOCK_EXCLUSIVE | Last connection closing | Final checkpoint + WAL/SHM cleanup |
Critical fact: POSIX file locks are process-scoped. When a process crashes, all its locks are automatically released. A crashed process cannot leave a WAL lock permanently held. This eliminates an entire class of lock-related failures — unless you're on a network filesystem, where all bets are off.
Checkpointing: The Hidden Complexity
Checkpointing writes WAL frames back to the main database file, allowing the WAL to be recycled. There are four checkpoint modes:
| Mode | Blocks? | Use Case |
|---|---|---|
| PASSIVE | Never | Default auto-checkpoint; best effort, skips if anything is busy |
| FULL | Waits for writers | Complete checkpoint; readers can still use WAL |
| RESTART | Waits for all | Resets WAL; ensures future reads start from DB file |
| TRUNCATE | Waits for all | Like RESTART + physically shrinks WAL to 0 bytes |
The automatic checkpoint triggered by wal_autocheckpoint (default: 1000 pages) uses PASSIVE mode. This is the root cause of most WAL problems in production.
The Seven Pitfalls
1. Default busy_timeout Is Zero
Out of the box, SQLite returns SQLITE_BUSY immediately when it encounters any lock contention. In a service with any concurrency, this guarantees spurious failures.
Fix: Always set PRAGMA busy_timeout = 5000 (or higher) before any operations. For long-running agent processes, 5000-30000ms is standard.
2. SQLITE_BUSY_SNAPSHOT Is Not Retryable
When a transaction starts as a read and later attempts a write, SQLite must upgrade the transaction. If another connection has written since the read began, the upgrade fails with SQLITE_BUSY_SNAPSHOT. Unlike regular SQLITE_BUSY, the busy timeout does not help — the snapshot is stale and the entire transaction must be restarted.
Fix: Use BEGIN IMMEDIATE when you know a transaction will write. This acquires the write lock at transaction start, eliminating snapshot conflicts.
3. Checkpoint Starvation (The Silent Killer)
This is the most dangerous WAL pitfall for long-running processes:
- The PASSIVE auto-checkpoint can only reset the WAL if no active readers need those WAL pages
- In a busy service, even short reads that overlap with the checkpoint prevent it from completing
- The checkpoint silently does nothing. No error. No warning. No log.
- The WAL grows indefinitely. Read performance degrades proportionally to WAL size.
- Eventually: multi-GB WAL files, severe performance degradation, disk pressure
Fix: Monitor WAL file size. Schedule PRAGMA wal_checkpoint(RESTART) during idle periods. Consider disabling auto-checkpoint entirely (wal_autocheckpoint = 0) and managing checkpoints manually.
4. busy_timeout Doesn't Apply to Checkpoints
Even with a generous busy timeout set, PRAGMA wal_checkpoint(RESTART) returns SQLITE_BUSY immediately if any reader is active. The timeout is ignored for checkpoint operations.
Fix: Implement your own retry loop with backoff for manual checkpoint operations.
5. Single-Writer Serialization Overhead
WAL allows exactly one writer at a time. Multiple Node.js connections all writing to the same database serialize anyway — they just add lock contention overhead without any throughput benefit.
Fix: Use a single connection for all writes (single-writer pattern). In Node.js with better-sqlite3, this is natural since the API is synchronous.
6. Network Filesystem Incompatibility
WAL requires shared memory via the SHM file. Network filesystems (NFS, CIFS/SMB, FUSE over network) either lack POSIX locking semantics or implement them incorrectly. Results: SQLITE_IOERR, phantom locks that survive process death, database corruption.
Fix: Always use local storage for SQLite databases. For multi-machine access, use replication (Litestream, LiteFS, Turso) or migrate to PostgreSQL.
7. Async Node.js Libraries Add Complexity
The popular node-sqlite3 library uses libuv threads, making lock ordering harder to reason about. Each query may execute on a different thread, and "forgot to await" bugs can cause write-after-write conflicts.
Fix: Use better-sqlite3 (synchronous API). Its single-threaded execution model naturally prevents concurrency bugs and benchmarks 30-50% faster for typical workloads.
Production Hardening
Recommended PRAGMA Settings
const db = new Database('agent.sqlite');
// Core: WAL mode (persistent, safe to repeat)
db.pragma('journal_mode = WAL');
// Core: Retry on contention instead of failing immediately
db.pragma('busy_timeout = 5000');
// Performance: fsync only at checkpoint, not per-commit
// Still crash-safe in WAL mode
db.pragma('synchronous = NORMAL');
// Performance: 32MB page cache (default is ~2MB)
db.pragma('cache_size = -32000');
// Performance: Memory-mapped I/O up to 128MB
db.pragma('mmap_size = 134217728');
// Performance: Temp tables in memory
db.pragma('temp_store = 2');
// Integrity: Enforce foreign keys
db.pragma('foreign_keys = ON');
On connection close or periodically:
db.pragma('analysis_limit = 400');
db.pragma('optimize');
WAL Health Monitoring
The most important metric is WAL file size. A growing WAL means checkpoints aren't completing:
function checkWalHealth(dbPath) {
const walPath = dbPath + '-wal';
try {
const stat = fs.statSync(walPath);
const sizeMB = stat.size / (1024 * 1024);
if (sizeMB > 50) {
console.warn(`WAL is ${sizeMB.toFixed(1)}MB — checkpoint starvation likely`);
return { healthy: false, sizeMB };
}
return { healthy: true, sizeMB };
} catch (e) {
// WAL doesn't exist = fully checkpointed = healthy
return { healthy: true, sizeMB: 0 };
}
}
Pair this with periodic manual checkpoints:
setInterval(() => {
try {
const result = db.pragma('wal_checkpoint(RESTART)');
// result: [{busy, log, checkpointed}]
if (result[0].busy > 0) {
console.log('Checkpoint blocked by active readers');
}
} catch (err) {
console.error('Checkpoint failed:', err.message);
}
}, 5 * 60 * 1000); // Every 5 minutes
SQLITE_BUSY Error Handling
Different BUSY errors require different responses:
function executeWrite(db, sql, params, maxRetries = 3) {
for (let i = 0; i <= maxRetries; i++) {
try {
return db.prepare(sql).run(params);
} catch (err) {
if (err.code === 'SQLITE_BUSY_SNAPSHOT') {
// Stale snapshot — MUST restart entire transaction
throw err;
}
if (err.code === 'SQLITE_BUSY_RECOVERY') {
// Another process recovering from crash — wait briefly
Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, 500);
continue;
}
if (err.code === 'SQLITE_BUSY' && i < maxRetries) {
// Write lock contention — exponential backoff
const delay = Math.min(100 * Math.pow(2, i), 5000);
Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, delay);
continue;
}
throw err;
}
}
}
Recovery: When Things Go Wrong
What "Stuck WAL Lock" Actually Means
When operators report a "stuck WAL lock," they usually mean one of three things:
-
Checkpoint starvation: Not actually a lock issue. The WAL is growing because PASSIVE checkpoints keep failing. Fix: run a RESTART checkpoint during a quiet period.
-
Hung process in D-state: A process stuck in uninterruptible sleep (Linux D state) still holds its file descriptors and POSIX locks open. Detect with
lsof /path/to/db.sqliteandcat /proc/<PID>/status | grep State. Fix: kill the hung process. -
Network filesystem phantom locks: Locks managed by the NFS lock daemon can survive process death. Fix: move to local storage.
Safe WAL Recovery Procedure
# Step 1: Verify no processes have the database open
lsof /path/to/db.sqlite 2>/dev/null
# Step 2: Run a TRUNCATE checkpoint
sqlite3 /path/to/db.sqlite "PRAGMA wal_checkpoint(TRUNCATE);"
# Step 3: Verify integrity
sqlite3 /path/to/db.sqlite "PRAGMA integrity_check;"
# Must return "ok"
# Step 4: If checkpoint succeeded, WAL/SHM are safe to delete
# (but usually unnecessary — checkpoint handles cleanup)
Never delete a WAL file with uncommitted transactions. This causes data loss. Always checkpoint first.
Crash Recovery Is Automatic
When a service restarts after a crash, the first connection to open the database automatically runs WAL recovery: replaying committed transactions from the WAL to the main database. This typically takes less than a second. During recovery, other connections get SQLITE_BUSY_RECOVERY — a transient error that resolves when recovery completes.
The exception: if checkpoint starvation produced a multi-GB WAL before the crash, recovery can take minutes. This is another reason to monitor and control WAL growth proactively.
The Production Ecosystem
Litestream: Streaming WAL Replication
Litestream replicates WAL frames to S3-compatible storage in near-real-time. It works by holding a long-running read transaction (preventing other checkpoints), copying WAL pages to sequentially numbered "shadow WAL" files, then managing checkpoints itself.
Its three-tier checkpoint strategy is instructive:
- Non-blocking (PASSIVE): frequent, at ~4MB WAL size
- Time-based: periodic cleanup even when WAL is small
- Emergency truncation: at ~500MB, blocks everything to prevent unbounded growth
Turso/libSQL: SQLite Over the Network
Turso's embedded replicas pattern is ideal for agent systems: the full database lives locally for fast reads, syncing to a remote server for durability. The "database per agent" model eliminates all cross-agent lock contention — each agent operates on its own isolated SQLite file.
Cloudflare D1: SQLite at the Edge
D1 runs SQLite in WAL mode inside Durable Objects, synchronously replicating WAL frames to 5 datacenters. Global read replication (sequential consistency) was added in 2025, making it viable for geographically distributed agent deployments.
When to Stay vs. When to Migrate
SQLite Is Sufficient When:
- Single agent process per database file
- Write rate under ~5,000 transactions/second
- Read-heavy workloads (agent message queues are typically read-heavy)
- Local filesystem access (not NFS)
- Single-machine deployment
- You need zero-dependency embeddability
Consider PostgreSQL When:
- Multiple concurrent writers are needed (PG has row-level locking)
- You need
LISTEN/NOTIFYfor real-time event distribution between agents - Cross-agent queries with JOINs are frequent
- You need
SKIP LOCKEDfor work queue patterns - Multi-machine write access is required
- Write QPS exceeds ~5,000/sec sustained
Latest Developments (2025-2026)
SQLite 3.50.0 (May 2025, 25th anniversary release) introduced sqlite3_setlk_timeout() — a separate timeout specifically for blocking locks, distinct from busy_timeout. This gives production services more precise control over lock wait behavior.
WAL2 / BEGIN CONCURRENT remains on experimental branches only. This feature would enable optimistic multi-writer concurrency, but it has not been merged to the main SQLite trunk and has no announced release timeline. Do not architect production systems around it shipping.
Node.js ecosystem: better-sqlite3 remains the consensus choice for production Node.js. The experimental node:sqlite built-in (Node.js 22.5+) is not yet production-ready. Bun's bun:sqlite claims 3-6x performance gains, but benchmarks are contested.
Lessons from the Field
Running SQLite as a message queue for AI agents, we learned:
- Monitor WAL size, not just database size. A 5MB database with a 500MB WAL is a ticking time bomb.
- Set busy_timeout before anything else. The default of zero makes every concurrent operation a coin flip.
- Use BEGIN IMMEDIATE for writes. SQLITE_BUSY_SNAPSHOT errors are the most confusing production failures because they bypass busy_timeout and require full transaction restarts.
- Schedule periodic RESTART checkpoints. Don't rely on the PASSIVE auto-checkpoint for long-running processes.
- One database per agent is the simplest path to eliminating lock contention in multi-agent systems.
- Better-sqlite3's synchronous API is a feature. It makes lock ordering trivial and eliminates "forgot to await" concurrency bugs.
SQLite is a remarkable piece of engineering — 25 years old and still gaining adoption. But like any tool, its failure modes are specific and learnable. Understanding WAL internals transforms SQLite from "that embedded database" into a reliable foundation for production agent systems.

