Database Replication: Strategies, Patterns, and Best Practices for High Availability

Executive Summary

Database replication is a foundational technique for achieving high availability, load distribution, and disaster recovery in modern systems. This research explores the major replication strategies—master-slave and multi-master—their trade-offs, implementation patterns for PostgreSQL and MySQL, conflict resolution approaches, and monitoring best practices as of 2026.

Key Insights:

Master-slave remains the recommended default for most workloads due to simplicity and reliability
Multi-master should be used sparingly, only when truly needed for specific edge cases
PostgreSQL 16's logical replication improvements make active-active more accessible
Conflict-Free Replicated Data Types (CRDTs) are emerging as a powerful solution for conflict resolution
Modern monitoring tools leverage OpenTelemetry for cross-database observability

Replication Strategies

Master-Slave (Primary-Replica) Replication

Master-slave replication establishes a clear hierarchy: one primary database (the master) processes all write operations, while multiple replicas (slaves) synchronize with it to serve read operations.

Architecture:

One master node handles all writes
One or more slave nodes are read-only
Slaves replicate changes from master asynchronously or synchronously
Changes flow unidirectionally: master → slaves

Advantages:

Read scalability: Distribute read load across multiple replicas for APIs, dashboards, and reports
Geographic distribution: Lower latency by routing users to nearby replicas
Simplicity: Clear write path eliminates conflict resolution complexity
Battle-tested: Mature implementations with extensive tooling

Limitations:

Single point of failure: If master goes down, writes stop until failover
Write bottleneck: All writes limited to single master node
Replication lag: Slaves may lag behind master, causing eventual consistency issues

Best Use Cases:

Read-heavy workloads (80%+ reads)
Applications tolerating eventual consistency
Cost-effective horizontal scaling for reads

Multi-Master Replication

In multi-master replication, any node can accept both reads and writes. There's no single write leader—each primary records local changes and ships them to peers.

Architecture:

Multiple nodes accept writes simultaneously
Bidirectional replication between all nodes
Each node acts as both primary and replica
Requires conflict resolution mechanisms

Advantages:

Higher fault tolerance: No single point of failure for writes
Lower write latency: Users write to geographically nearest node (EU → EU, APAC → APAC)
True active-active: All nodes fully operational simultaneously

Challenges:

Conflict complexity: Concurrent writes to same data require resolution
Data consistency: Harder to maintain strong consistency guarantees
Operational overhead: More complex monitoring and troubleshooting
Potential data loss: Conflict resolution may discard updates

Best Use Cases:

Multi-region deployments with write requirements in each region
Edge computing with intermittent connectivity (retail stores, ships, military)
Extreme high availability requirements where downtime is unacceptable
Applications designed for eventual consistency

2026 Recommendation: Avoid multi-master unless truly needed—use master-slave for simplicity. Only adopt multi-master when geographic write latency or extreme HA justifies the operational complexity.

Database-Specific Implementations

PostgreSQL Replication

PostgreSQL offers two primary replication methods:

Physical Replication (Streaming Replication):

Uses Write-Ahead Log (WAL) files
Byte-level replication of entire database cluster
Faster, more reliable, and easier to manage than logical replication
Default method for standby servers

Logical Replication (Publish/Subscribe):

Replicates at the table/row level using logical decoding
Allows selective replication of specific tables
Supports different PostgreSQL versions and partial replication
PostgreSQL 16 improvements enable practical active-active setups

Best Practices:

WAL Archiving: Not mandatory but critical for robust replication—prevents master from recycling WALs not yet applied to standbys
Synchronous vs. Asynchronous: Default is asynchronous (performance); use synchronous when replicas must mirror primary exactly
Monitoring Replication Slots: Failed replicas that don't consume WALs can fill disk—clean up orphaned slots regularly
Use pg_stat_replication: Built-in view for monitoring replication status and lag

Tools:

pgAdmin for visual monitoring
pgactive extension (AWS RDS) for active-active replication
pg_chameleon for cross-database replication (PostgreSQL ↔ MySQL)

MySQL Replication

MySQL uses logical replication via binary logs (binlog):

Architecture:

Primary logs changes to binlog
Replicas read binlog and apply changes via SQL
Supports primary-to-single and primary-to-multiple-replica configurations
Asynchronous by default, semi-synchronous available

Key Configurations:

Asynchronous Replication: Primary doesn't wait for replica acknowledgment (fastest, but potential data loss)
Semi-Synchronous Replication: Primary waits for at least one replica to acknowledge before committing
Group Replication: Multi-primary mode with built-in conflict detection

Best Practices:

Binary Log Format: Use ROW format for consistent replication (vs. STATEMENT which can cause inconsistencies)
GTIDs (Global Transaction Identifiers): Enable for easier failover and topology changes
Monitoring Seconds_Behind_Source: Key metric for replication lag in MySQL
Delayed Replication: Intentionally lag replicas for protection against accidental deletes

MySQL Cluster (NDB):

Advanced conflict detection and resolution for active-active setups
Timestamp-based and application-defined conflict handlers
Higher operational complexity

Conflict Resolution and Eventual Consistency

Understanding Eventual Consistency

In distributed systems, eventual consistency guarantees that, given enough time without new updates, all replicas will converge to the same state. This trades immediate consistency for availability and partition tolerance (CAP theorem).

Types of Conflicts

Write-Write Conflicts:

Two or more nodes update same data concurrently
Most common in multi-master scenarios
Requires automated or manual resolution

Read-Write Conflicts:

Read occurs simultaneously with write on different nodes
Can result in reading stale data
Acceptable in many applications

Conflict Resolution Strategies

1. Last-Write-Wins (LWW)

Simplest approach: retain update with most recent timestamp
Pros: Easy to implement, deterministic
Cons: May lose valuable data, timestamp synchronization required
Use case: When all writes are equally important (or unimportant)

2. Version Vectors and Clocks

Track causality between updates using vector clocks
Detect concurrent updates that need resolution
Pros: Accurately identifies conflicts
Cons: Doesn't resolve conflicts automatically—requires additional strategy

3. Conflict-Free Replicated Data Types (CRDTs)

Data structures designed to converge without coordination
Commutative operations ensure same result regardless of order
Types: Counters (G-Counter, PN-Counter), Sets (OR-Set), Maps, Registers
Pros: Automatic resolution, guaranteed convergence, no coordination needed
Cons: Limited data types, memory overhead, not suitable for all use cases
2026 Status: Gaining traction for collaborative applications, distributed caches, and real-time systems

4. User-Specified Conflict Handlers

Application-defined logic for conflict resolution
Business rules determine which update wins
Pros: Flexible, domain-specific resolution
Cons: Requires careful design, increased application complexity

5. Application-Level Resolution

Present conflicts to users for manual resolution
Used when automated resolution risks data loss
Example: Git merge conflicts, collaborative document editing

Reconciliation Process

To ensure replica convergence:

Anti-entropy: Exchange versions/updates between servers periodically
Reconciliation: Choose appropriate final state when concurrent updates occur
Convergence verification: Ensure all replicas reach same state eventually

Active-Active Replication Patterns

Active-active (multi-primary) replication allows all nodes to handle reads and writes simultaneously. As of 2026, several patterns have emerged:

Common Patterns

1. Bidirectional Replication

Two nodes replicate to each other symmetrically
Simplest multi-master setup
Suitable for two-datacenter HA

2. Multi-Region Active-Active

Each region contains full service stack (app servers, databases, caches)
Database replication maintains consistency across regions
Users routed to nearest region for low latency
Example: Capital One's shared-nothing architecture

3. Edge Computing Active-Active

Each edge location operates independently
Intermittent connectivity to central systems
Local writes continue during network partitions
Use cases: Retail stores, ships, IoT gateways, military deployments

PostgreSQL Active-Active in 2026

PostgreSQL 16's logical replication improvements make active-active more practical:

Bi-directional replication using native logical replication
Improved conflict detection and handling
Extensions like pgactive (AWS RDS) simplify setup
Still requires careful conflict resolution design

When to Use:

Business continuity across regions with extreme HA needs
Geographic write latency significantly impacts user experience
Edge scenarios with intermittent connectivity

When to Avoid:

Database is already fast enough
Network latency acceptable
Application not designed for eventual consistency
Team lacks expertise in distributed systems

Monitoring Replication Lag

Replication lag is the delay between a write on the primary and its application on replicas. Monitoring lag is critical for maintaining service quality.

Key Metrics

PostgreSQL:

pg_stat_replication view: replay_lag, write_lag, flush_lag
Network lag via cloudsql.googleapis.com/database/replication/network_lag (Google Cloud SQL)

MySQL:

Seconds_Behind_Source: Estimates how far replica lags behind primary
Performance Schema tables: replication_connection_status, replication_applier_status_by_worker
ReplicaLag metric in Amazon CloudWatch (AWS RDS)

Monitoring Tools (2026)

Cloud-Native:

Google Cloud SQL: Built-in replication lag metrics
AWS RDS: CloudWatch integration for replication monitoring
Azure Database: CPU/memory monitoring correlated with lag

Open Source:

OpenTelemetry metrics for cross-database monitoring (PostgreSQL, MySQL, MongoDB)
Prometheus + Grafana for time-series lag visualization
Percona Monitoring and Management (PMM) for MySQL/PostgreSQL

Database-Specific:

pgAdmin for PostgreSQL visual monitoring
MySQL Workbench for replication topology visualization
Performance Schema (MySQL) for detailed replication insights

Lag Patterns and Diagnosis

Periodic Spikes:

Cause: Scheduled backups, batch jobs
Action: Schedule intensive operations during low-traffic windows

Post-Migration Lag:

Cause: Large schema changes (ALTER TABLE)
Action: Use online DDL tools (pt-online-schema-change, gh-ost)

Gradually Increasing Lag:

Cause: Hardware constraints (CPU, disk I/O)
Action: Scale replica resources or optimize queries

Sudden Jumps:

Cause: Network issues, disk failures
Action: Check network connectivity, verify disk health

Best Practices Summary

Architecture Selection

Default to master-slave unless specific requirements demand multi-master
Use synchronous replication only when absolutely necessary (performance cost)
Design for eventual consistency in multi-master scenarios
Test failover procedures regularly—don't wait for production incidents

PostgreSQL Recommendations

Enable WAL archiving for production replication
Monitor replication slots to prevent disk fill-up
Use logical replication for partial or cross-version replication
Consider pgactive or similar extensions for active-active if needed

MySQL Recommendations

Use ROW binlog format for consistency
Enable GTIDs for easier topology management
Monitor Seconds_Behind_Source continuously
Consider MySQL Group Replication for built-in conflict detection

Conflict Resolution

Choose simplest strategy that meets requirements (prefer LWW when acceptable)
Evaluate CRDTs for collaborative or real-time use cases
Design application with eventual consistency in mind
Implement comprehensive testing for conflict scenarios

Monitoring

Track replication lag as primary health metric
Set alerts for lag thresholds (e.g., >5 seconds)
Correlate lag with system metrics (CPU, disk, network)
Use OpenTelemetry for unified observability across databases

Operational

Automate failover for master-slave setups
Document runbooks for common replication issues
Practice disaster recovery scenarios quarterly
Keep replication topology documented and up-to-date

Conclusion

Database replication remains a cornerstone of high-availability architectures in 2026. While technology has advanced—PostgreSQL 16's logical replication, CRDTs for conflict resolution, OpenTelemetry for monitoring—the fundamental trade-offs persist: consistency vs. availability, simplicity vs. fault tolerance.

For most applications, master-slave replication provides the optimal balance of reliability, performance, and operational simplicity. Multi-master should be reserved for scenarios with clear geographic or availability requirements that justify the additional complexity.

The key to successful replication lies not in choosing the most advanced pattern, but in understanding your system's requirements, designing for the chosen consistency model, and maintaining robust monitoring and failover procedures.

Sources: