Database migration playbook: zero-downtime strategies
Four patterns for zero-downtime migration
Dual-Write
Write to both the old and new database simultaneously during migration. The application layer handles consistency while you validate the new store in production conditions.
Higher write latency during migration, but zero read disruption.
Shadow Reads
Read from both databases and compare results without serving the new data to users. Catches schema mismatches, encoding differences, and query performance regressions before cutover.
Doubles read load temporarily, but provides confidence before switching.
Expand-Contract Schema
Add new columns or tables alongside existing ones (expand), migrate data, update application code, then remove old structures (contract). No breaking changes at any step.
Requires disciplined multi-phase deployments and backward-compatible code.
Online Migration Tooling
Tools like gh-ost, pt-online-schema-change, and pg_logical perform schema changes on live databases by creating shadow tables and replaying changes without locking.
Needs sufficient disk space for shadow tables and careful monitoring.
Choosing your migration strategy
Six phases of a zero-downtime migration
Schema Analysis
Map every table, index, constraint, and stored procedure. Identify type mismatches between source and target. Document implicit dependencies that ORMs hide.
Compatibility Layer
Build an abstraction that lets application code work with both old and new schemas simultaneously. Feature flags control which database serves reads.
Dual-Write Setup
Instrument the write path to persist data to both databases. Use async replication for bulk history, synchronous writes for new data. Monitor write latency closely.
Validation & Reconciliation
Run continuous comparison jobs between source and target. Check row counts, checksums, and sampled field values. Fix discrepancies before proceeding.
Traffic Cutover
Shift read traffic gradually — 1%, 10%, 50%, 100%. Monitor error rates, latency percentiles, and data consistency at each step. Keep the old database warm.
Cleanup
Remove dual-write code, decommission the old database, and archive the compatibility layer. Update documentation and runbooks. Run a retrospective.
Migration by the numbers
Tools we recommend for database migration
pg_logical
Logical replication for PostgreSQL — migrate between versions or to new clusters without downtime using publication/subscription.
Best for: PostgreSQL version upgrades and cross-cluster migration
gh-ost
GitHub's online schema migration for MySQL. Creates shadow tables, copies data, replays binlog changes, and performs an atomic cutover.
Best for: MySQL schema changes on large tables in production
pt-online-schema-change
Percona's tool for ALTER TABLE operations without locking. Uses triggers to capture changes during migration.
Best for: MySQL environments where trigger-based migration is acceptable
AWS DMS
Database Migration Service handles heterogeneous migrations — Oracle to PostgreSQL, SQL Server to Aurora, and more with change data capture.
Best for: Cross-engine migrations and cloud database adoption
Flyway
Version-controlled schema migrations with a migration-per-file approach. Tracks applied migrations and ensures repeatable deployments.
Best for: Schema version control and CI/CD integration
Liquibase
Database-agnostic schema change management with rollback support, diff generation, and preconditions for safe deployments.
Best for: Multi-database environments needing vendor-neutral migration tooling
Zero downtime isn't a luxury — it's an engineering choice.
Every maintenance window is a decision to accept risk. Every “we'll do it Saturday night” is a bet that nothing will go wrong under pressure and fatigue. Zero-downtime migration takes more upfront engineering, but it eliminates the late-night gambles and gives you the ability to roll back at any point. That's not perfectionism — that's professionalism.
Connection pooling during migration
External Connection Poolers
PgBouncer (for PostgreSQL) and ProxySQL (for MySQL) sit between your application and the database, multiplexing hundreds of application connections into a smaller pool of actual database connections. During migration, deploy a pooler in front of both the source and target databases. Configure the pooler in transaction mode, not session mode, so connections are returned to the pool after each transaction rather than being held for the lifetime of the application connection. This is critical during dual-write, where each application request touches two databases and you cannot afford to hold two connections idle between requests.
For cloud-managed databases, use the provider's built-in pooling: Amazon RDS Proxy, Azure Database Flexible Server connection pooling, or Google Cloud SQL Auth Proxy. These handle TLS termination and IAM authentication alongside connection multiplexing, reducing operational surface area during an already complex migration.
Application-Level Pool Configuration
Your ORM or database driver has its own connection pool settings. During migration, reduce the maximum pool size for each database to half your normal limit. If your application normally uses a pool of 20 connections and you are dual-writing, configure 10 connections to the source and 10 to the target. Monitor pool wait times — if application threads are blocking on pool acquisition, you need to either increase pool limits (if the database can handle it) or optimize your queries to hold connections for shorter durations.
Set aggressive connection timeouts during migration. A connection checkout timeout of 5 seconds (rather than the default 30 seconds in many ORMs) ensures you fail fast and surface pool exhaustion as an error rather than a slow degradation that is hard to diagnose. Pair this with circuit breakers on the write path so a slow target database does not back up the connection pool and take down reads from the source.
Monitoring and alerting during cutover
Replication Lag Monitoring
If you are using change data capture or logical replication to keep the target in sync, replication lag is your most important metric. Set up a heartbeat table that writes a timestamp to the source every second and measures when it appears on the target. Alert if lag exceeds your tolerance (typically 1-5 seconds for online migration). Do not proceed with cutover until lag is consistently below your threshold for at least 30 minutes.
Query Latency Percentiles
Track P50, P95, and P99 latency for both read and write queries against the target database before shifting production traffic. Compare these percentiles against the source database baseline. A P50 that matches but a P99 that is 10x higher indicates missing indexes or suboptimal query plans on the target. During traffic shift, monitor these percentiles per percentage step — if P99 degrades at 10% traffic, do not proceed to 50%.
Error Rate by Query Type
Aggregate error rates hide critical issues. A 0.1% overall error rate during cutover sounds acceptable until you discover that 100% of a specific query type is failing because of a schema difference. Break error monitoring down by query category: reads vs writes, by table, and by application endpoint. Set per-category alert thresholds that are tighter than your normal production alerting. During migration, a 0.01% error rate on writes should trigger investigation.
Data Consistency Dashboard
Build a dedicated dashboard that runs continuous row-count comparisons and checksum validations between source and target. Display: total row counts per table, delta since last check, checksum mismatches by table, and the timestamp of the last successful full reconciliation. This dashboard should be visible to the entire migration team during cutover — ideally on a shared screen or war room display. Any divergence between source and target is a reason to pause the cutover.
Testing strategies for migration validation
Schema Compatibility Tests
Write automated tests that compare the source and target schemas column by column. Check data types, nullable constraints, default values, indexes, and foreign keys. A VARCHAR(255) on MySQL becoming a TEXT on PostgreSQL changes storage behavior and index capabilities. A DATETIME column losing timezone information during migration will corrupt timestamps silently. These tests should run in CI against both databases and fail the build if schemas diverge in unexpected ways.
Shadow Traffic Replay
Capture a sample of production queries (reads and writes) from your application logs or query audit log. Replay them against the target database and compare results. For reads, compare result sets row by row. For writes, compare the resulting database state. This catches issues that synthetic tests miss: queries that rely on implicit type casting, collation differences, or database-specific SQL extensions. Tools like pt-query-digest (Percona) can capture and replay MySQL queries. For PostgreSQL, use pgbadger to analyze query logs and generate replay scripts.
Load Testing at Production Scale
Restore a production-sized dataset to the target database and run load tests that simulate your peak traffic pattern. Do not skip this step because the target database is “the same engine, just a newer version.” Query planner behavior changes between versions. Index selection heuristics change. Connection handling changes. A migration that performs perfectly with 1 million rows and 10 concurrent connections may fail catastrophically with 500 million rows and 200 connections because the query planner chooses a sequential scan instead of an index lookup at scale.
Rollback Drill
Practice the rollback procedure before you need it. Simulate a failed cutover: shift 10% of traffic to the target, inject a failure condition, and execute the rollback plan. Measure how long it takes and whether any data is lost during the rollback. If your rollback involves switching a feature flag, verify the flag propagation time. If it involves DNS changes, measure TTL-related lag. A rollback plan that has never been tested is not a plan — it is a hope.
Real-world timeline for a typical database migration
Discovery and Schema Analysis
Map the complete schema including stored procedures, triggers, views, and implicit dependencies. Identify every application that connects to the database — not just your primary application, but reporting tools, cron jobs, data science notebooks, and any third-party integrations with direct database access. Document the current connection count, peak QPS, and storage volume. This phase always takes longer than expected because undocumented consumers appear.
Compatibility Layer and Dual-Write Implementation
Build the abstraction layer that lets application code work with both databases. Implement feature flags for read/write routing. Set up the dual-write path with error handling that ensures source writes succeed even if target writes fail (the source is still the authority). Deploy connection poolers. This phase includes writing the reconciliation jobs that will validate data consistency throughout the migration.
Data Migration and Validation
Run the initial bulk data copy. For large datasets, this can take days — a billion-row table with complex indexes may take 48-72 hours to fully copy and index. Start continuous replication to keep the target in sync with ongoing writes. Run reconciliation jobs continuously. Fix data issues as they surface: encoding mismatches, truncated values, timezone conversions, and sequences that need resetting. This is typically the longest phase and the one with the most surprises.
Shadow Reads and Performance Validation
Enable shadow reads to compare query results between source and target in real time. Run load tests against the target at production scale. Tune query plans, add missing indexes, and adjust database configuration parameters. Conduct the rollback drill. By the end of this phase, you should have high confidence that the target database handles every query your application issues with acceptable latency.
Gradual Cutover and Cleanup
Shift read traffic in increments: 1%, 5%, 10%, 25%, 50%, 100%. Each step runs for at least a few hours (ideally a full business day) with monitoring. If any step shows regression, pause and investigate before proceeding. Once at 100% reads on the target, keep the source database running in receive-only mode for one week as a safety net. Then decommission: remove dual-write code, shut down the source, and archive backups. Run a retrospective documenting what worked, what surprised you, and what you would change for the next migration.