Zylos Logo
Zylos
2026-02-13

Database Replication: Strategies, Patterns, and Best Practices for High Availability

researchdatabasereplicationhigh-availabilitypostgresqlmysqldistributed-systems

Executive Summary

Database replication is a foundational technique for achieving high availability, load distribution, and disaster recovery in modern systems. This research explores the major replication strategies—master-slave and multi-master—their trade-offs, implementation patterns for PostgreSQL and MySQL, conflict resolution approaches, and monitoring best practices as of 2026.

Key Insights:

  • Master-slave remains the recommended default for most workloads due to simplicity and reliability
  • Multi-master should be used sparingly, only when truly needed for specific edge cases
  • PostgreSQL 16's logical replication improvements make active-active more accessible
  • Conflict-Free Replicated Data Types (CRDTs) are emerging as a powerful solution for conflict resolution
  • Modern monitoring tools leverage OpenTelemetry for cross-database observability

Replication Strategies

Master-Slave (Primary-Replica) Replication

Master-slave replication establishes a clear hierarchy: one primary database (the master) processes all write operations, while multiple replicas (slaves) synchronize with it to serve read operations.

Architecture:

  • One master node handles all writes
  • One or more slave nodes are read-only
  • Slaves replicate changes from master asynchronously or synchronously
  • Changes flow unidirectionally: master → slaves

Advantages:

  • Read scalability: Distribute read load across multiple replicas for APIs, dashboards, and reports
  • Geographic distribution: Lower latency by routing users to nearby replicas
  • Simplicity: Clear write path eliminates conflict resolution complexity
  • Battle-tested: Mature implementations with extensive tooling

Limitations:

  • Single point of failure: If master goes down, writes stop until failover
  • Write bottleneck: All writes limited to single master node
  • Replication lag: Slaves may lag behind master, causing eventual consistency issues

Best Use Cases:

  • Read-heavy workloads (80%+ reads)
  • Applications tolerating eventual consistency
  • Cost-effective horizontal scaling for reads

Multi-Master Replication

In multi-master replication, any node can accept both reads and writes. There's no single write leader—each primary records local changes and ships them to peers.

Architecture:

  • Multiple nodes accept writes simultaneously
  • Bidirectional replication between all nodes
  • Each node acts as both primary and replica
  • Requires conflict resolution mechanisms

Advantages:

  • Higher fault tolerance: No single point of failure for writes
  • Lower write latency: Users write to geographically nearest node (EU → EU, APAC → APAC)
  • True active-active: All nodes fully operational simultaneously

Challenges:

  • Conflict complexity: Concurrent writes to same data require resolution
  • Data consistency: Harder to maintain strong consistency guarantees
  • Operational overhead: More complex monitoring and troubleshooting
  • Potential data loss: Conflict resolution may discard updates

Best Use Cases:

  • Multi-region deployments with write requirements in each region
  • Edge computing with intermittent connectivity (retail stores, ships, military)
  • Extreme high availability requirements where downtime is unacceptable
  • Applications designed for eventual consistency

2026 Recommendation: Avoid multi-master unless truly needed—use master-slave for simplicity. Only adopt multi-master when geographic write latency or extreme HA justifies the operational complexity.

Database-Specific Implementations

PostgreSQL Replication

PostgreSQL offers two primary replication methods:

Physical Replication (Streaming Replication):

  • Uses Write-Ahead Log (WAL) files
  • Byte-level replication of entire database cluster
  • Faster, more reliable, and easier to manage than logical replication
  • Default method for standby servers

Logical Replication (Publish/Subscribe):

  • Replicates at the table/row level using logical decoding
  • Allows selective replication of specific tables
  • Supports different PostgreSQL versions and partial replication
  • PostgreSQL 16 improvements enable practical active-active setups

Best Practices:

  1. WAL Archiving: Not mandatory but critical for robust replication—prevents master from recycling WALs not yet applied to standbys
  2. Synchronous vs. Asynchronous: Default is asynchronous (performance); use synchronous when replicas must mirror primary exactly
  3. Monitoring Replication Slots: Failed replicas that don't consume WALs can fill disk—clean up orphaned slots regularly
  4. Use pg_stat_replication: Built-in view for monitoring replication status and lag

Tools:

  • pgAdmin for visual monitoring
  • pgactive extension (AWS RDS) for active-active replication
  • pg_chameleon for cross-database replication (PostgreSQL ↔ MySQL)

MySQL Replication

MySQL uses logical replication via binary logs (binlog):

Architecture:

  • Primary logs changes to binlog
  • Replicas read binlog and apply changes via SQL
  • Supports primary-to-single and primary-to-multiple-replica configurations
  • Asynchronous by default, semi-synchronous available

Key Configurations:

  • Asynchronous Replication: Primary doesn't wait for replica acknowledgment (fastest, but potential data loss)
  • Semi-Synchronous Replication: Primary waits for at least one replica to acknowledge before committing
  • Group Replication: Multi-primary mode with built-in conflict detection

Best Practices:

  1. Binary Log Format: Use ROW format for consistent replication (vs. STATEMENT which can cause inconsistencies)
  2. GTIDs (Global Transaction Identifiers): Enable for easier failover and topology changes
  3. Monitoring Seconds_Behind_Source: Key metric for replication lag in MySQL
  4. Delayed Replication: Intentionally lag replicas for protection against accidental deletes

MySQL Cluster (NDB):

  • Advanced conflict detection and resolution for active-active setups
  • Timestamp-based and application-defined conflict handlers
  • Higher operational complexity

Conflict Resolution and Eventual Consistency

Understanding Eventual Consistency

In distributed systems, eventual consistency guarantees that, given enough time without new updates, all replicas will converge to the same state. This trades immediate consistency for availability and partition tolerance (CAP theorem).

Types of Conflicts

Write-Write Conflicts:

  • Two or more nodes update same data concurrently
  • Most common in multi-master scenarios
  • Requires automated or manual resolution

Read-Write Conflicts:

  • Read occurs simultaneously with write on different nodes
  • Can result in reading stale data
  • Acceptable in many applications

Conflict Resolution Strategies

1. Last-Write-Wins (LWW)

  • Simplest approach: retain update with most recent timestamp
  • Pros: Easy to implement, deterministic
  • Cons: May lose valuable data, timestamp synchronization required
  • Use case: When all writes are equally important (or unimportant)

2. Version Vectors and Clocks

  • Track causality between updates using vector clocks
  • Detect concurrent updates that need resolution
  • Pros: Accurately identifies conflicts
  • Cons: Doesn't resolve conflicts automatically—requires additional strategy

3. Conflict-Free Replicated Data Types (CRDTs)

  • Data structures designed to converge without coordination
  • Commutative operations ensure same result regardless of order
  • Types: Counters (G-Counter, PN-Counter), Sets (OR-Set), Maps, Registers
  • Pros: Automatic resolution, guaranteed convergence, no coordination needed
  • Cons: Limited data types, memory overhead, not suitable for all use cases
  • 2026 Status: Gaining traction for collaborative applications, distributed caches, and real-time systems

4. User-Specified Conflict Handlers

  • Application-defined logic for conflict resolution
  • Business rules determine which update wins
  • Pros: Flexible, domain-specific resolution
  • Cons: Requires careful design, increased application complexity

5. Application-Level Resolution

  • Present conflicts to users for manual resolution
  • Used when automated resolution risks data loss
  • Example: Git merge conflicts, collaborative document editing

Reconciliation Process

To ensure replica convergence:

  1. Anti-entropy: Exchange versions/updates between servers periodically
  2. Reconciliation: Choose appropriate final state when concurrent updates occur
  3. Convergence verification: Ensure all replicas reach same state eventually

Active-Active Replication Patterns

Active-active (multi-primary) replication allows all nodes to handle reads and writes simultaneously. As of 2026, several patterns have emerged:

Common Patterns

1. Bidirectional Replication

  • Two nodes replicate to each other symmetrically
  • Simplest multi-master setup
  • Suitable for two-datacenter HA

2. Multi-Region Active-Active

  • Each region contains full service stack (app servers, databases, caches)
  • Database replication maintains consistency across regions
  • Users routed to nearest region for low latency
  • Example: Capital One's shared-nothing architecture

3. Edge Computing Active-Active

  • Each edge location operates independently
  • Intermittent connectivity to central systems
  • Local writes continue during network partitions
  • Use cases: Retail stores, ships, IoT gateways, military deployments

PostgreSQL Active-Active in 2026

PostgreSQL 16's logical replication improvements make active-active more practical:

  • Bi-directional replication using native logical replication
  • Improved conflict detection and handling
  • Extensions like pgactive (AWS RDS) simplify setup
  • Still requires careful conflict resolution design

When to Use:

  • Business continuity across regions with extreme HA needs
  • Geographic write latency significantly impacts user experience
  • Edge scenarios with intermittent connectivity

When to Avoid:

  • Database is already fast enough
  • Network latency acceptable
  • Application not designed for eventual consistency
  • Team lacks expertise in distributed systems

Monitoring Replication Lag

Replication lag is the delay between a write on the primary and its application on replicas. Monitoring lag is critical for maintaining service quality.

Key Metrics

PostgreSQL:

  • pg_stat_replication view: replay_lag, write_lag, flush_lag
  • Network lag via cloudsql.googleapis.com/database/replication/network_lag (Google Cloud SQL)

MySQL:

  • Seconds_Behind_Source: Estimates how far replica lags behind primary
  • Performance Schema tables: replication_connection_status, replication_applier_status_by_worker
  • ReplicaLag metric in Amazon CloudWatch (AWS RDS)

Monitoring Tools (2026)

Cloud-Native:

  • Google Cloud SQL: Built-in replication lag metrics
  • AWS RDS: CloudWatch integration for replication monitoring
  • Azure Database: CPU/memory monitoring correlated with lag

Open Source:

  • OpenTelemetry metrics for cross-database monitoring (PostgreSQL, MySQL, MongoDB)
  • Prometheus + Grafana for time-series lag visualization
  • Percona Monitoring and Management (PMM) for MySQL/PostgreSQL

Database-Specific:

  • pgAdmin for PostgreSQL visual monitoring
  • MySQL Workbench for replication topology visualization
  • Performance Schema (MySQL) for detailed replication insights

Lag Patterns and Diagnosis

Periodic Spikes:

  • Cause: Scheduled backups, batch jobs
  • Action: Schedule intensive operations during low-traffic windows

Post-Migration Lag:

  • Cause: Large schema changes (ALTER TABLE)
  • Action: Use online DDL tools (pt-online-schema-change, gh-ost)

Gradually Increasing Lag:

  • Cause: Hardware constraints (CPU, disk I/O)
  • Action: Scale replica resources or optimize queries

Sudden Jumps:

  • Cause: Network issues, disk failures
  • Action: Check network connectivity, verify disk health

Best Practices Summary

Architecture Selection

  1. Default to master-slave unless specific requirements demand multi-master
  2. Use synchronous replication only when absolutely necessary (performance cost)
  3. Design for eventual consistency in multi-master scenarios
  4. Test failover procedures regularly—don't wait for production incidents

PostgreSQL Recommendations

  • Enable WAL archiving for production replication
  • Monitor replication slots to prevent disk fill-up
  • Use logical replication for partial or cross-version replication
  • Consider pgactive or similar extensions for active-active if needed

MySQL Recommendations

  • Use ROW binlog format for consistency
  • Enable GTIDs for easier topology management
  • Monitor Seconds_Behind_Source continuously
  • Consider MySQL Group Replication for built-in conflict detection

Conflict Resolution

  • Choose simplest strategy that meets requirements (prefer LWW when acceptable)
  • Evaluate CRDTs for collaborative or real-time use cases
  • Design application with eventual consistency in mind
  • Implement comprehensive testing for conflict scenarios

Monitoring

  • Track replication lag as primary health metric
  • Set alerts for lag thresholds (e.g., >5 seconds)
  • Correlate lag with system metrics (CPU, disk, network)
  • Use OpenTelemetry for unified observability across databases

Operational

  • Automate failover for master-slave setups
  • Document runbooks for common replication issues
  • Practice disaster recovery scenarios quarterly
  • Keep replication topology documented and up-to-date

Conclusion

Database replication remains a cornerstone of high-availability architectures in 2026. While technology has advanced—PostgreSQL 16's logical replication, CRDTs for conflict resolution, OpenTelemetry for monitoring—the fundamental trade-offs persist: consistency vs. availability, simplicity vs. fault tolerance.

For most applications, master-slave replication provides the optimal balance of reliability, performance, and operational simplicity. Multi-master should be reserved for scenarios with clear geographic or availability requirements that justify the additional complexity.

The key to successful replication lies not in choosing the most advanced pattern, but in understanding your system's requirements, designing for the chosen consistency model, and maintaining robust monitoring and failover procedures.


Sources: