Data Engineering9 min read

Database migration playbook: zero-downtime strategies

“We'll schedule a maintenance window” is the phrase that launches a thousand late-night incidents. Modern database migration doesn't require downtime — it requires engineering discipline and the right patterns.

Patterns

Four patterns for zero-downtime migration

Each pattern trades a different resource — latency, disk space, code complexity — to eliminate downtime. Choose based on your constraints.

Dual-Write

Write to both the old and new database simultaneously during migration. The application layer handles consistency while you validate the new store in production conditions.

Higher write latency during migration, but zero read disruption.

Shadow Reads

Read from both databases and compare results without serving the new data to users. Catches schema mismatches, encoding differences, and query performance regressions before cutover.

Doubles read load temporarily, but provides confidence before switching.

Expand-Contract Schema

Add new columns or tables alongside existing ones (expand), migrate data, update application code, then remove old structures (contract). No breaking changes at any step.

Requires disciplined multi-phase deployments and backward-compatible code.

Online Migration Tooling

Tools like gh-ost, pt-online-schema-change, and pg_logical perform schema changes on live databases by creating shadow tables and replaying changes without locking.

Needs sufficient disk space for shadow tables and careful monitoring.

Approach Comparison

Choosing your migration strategy

The right approach depends on your downtime tolerance, data volume, and team's operational maturity. Here's how they compare.

Approach

Downtime

Complexity

Data Volume

Rollback

Offline Cutover

Downtime:Hours to days

Complexity:Low

Data Volume:Any size

Rollback:Full restore from backup

Blue-Green

Downtime:Seconds

Complexity:Medium

Data Volume:Moderate

Rollback:Switch DNS back instantly

Rolling Migration

Downtime:None

Complexity:High

Data Volume:Large

Rollback:Per-shard rollback possible

Live Migration

Downtime:None

Complexity:Very High

Data Volume:Any size

Rollback:Dual-write allows instant revert

The Playbook

Six phases of a zero-downtime migration

This isn't a weekend project. A successful zero-downtime migration is a multi-week engineering effort with checkpoints at every stage.

Schema Analysis

Map every table, index, constraint, and stored procedure. Identify type mismatches between source and target. Document implicit dependencies that ORMs hide.

Compatibility Layer

Build an abstraction that lets application code work with both old and new schemas simultaneously. Feature flags control which database serves reads.

Dual-Write Setup

Instrument the write path to persist data to both databases. Use async replication for bulk history, synchronous writes for new data. Monitor write latency closely.

Validation & Reconciliation

Run continuous comparison jobs between source and target. Check row counts, checksums, and sampled field values. Fix discrepancies before proceeding.

Traffic Cutover

Shift read traffic gradually — 1%, 10%, 50%, 100%. Monitor error rates, latency percentiles, and data consistency at each step. Keep the old database warm.

Cleanup

Remove dual-write code, decommission the old database, and archive the compatibility layer. Update documentation and runbooks. Run a retrospective.

Track Record

Migration by the numbers

2.4Brecords migrated in our largest zero-downtime database migration project

0 secof user-facing downtime achieved using dual-write with phased cutover

99.9997%data consistency verified through continuous reconciliation jobs

< 3 msadditional write latency during the dual-write phase

Tooling

Tools we recommend for database migration

No single tool covers every migration scenario. The right choice depends on your source database, target engine, and whether you need online schema changes or full data migration.

pg_logical

Logical replication for PostgreSQL — migrate between versions or to new clusters without downtime using publication/subscription.

Best for: PostgreSQL version upgrades and cross-cluster migration

gh-ost

GitHub's online schema migration for MySQL. Creates shadow tables, copies data, replays binlog changes, and performs an atomic cutover.

Best for: MySQL schema changes on large tables in production

pt-online-schema-change

Percona's tool for ALTER TABLE operations without locking. Uses triggers to capture changes during migration.

Best for: MySQL environments where trigger-based migration is acceptable

AWS DMS

Database Migration Service handles heterogeneous migrations — Oracle to PostgreSQL, SQL Server to Aurora, and more with change data capture.

Best for: Cross-engine migrations and cloud database adoption

Flyway

Version-controlled schema migrations with a migration-per-file approach. Tracks applied migrations and ensures repeatable deployments.

Best for: Schema version control and CI/CD integration

Liquibase

Database-agnostic schema change management with rollback support, diff generation, and preconditions for safe deployments.

Best for: Multi-database environments needing vendor-neutral migration tooling

The Standard

Zero downtime isn't a luxury — it's an engineering choice.

Every maintenance window is a decision to accept risk. Every “we'll do it Saturday night” is a bet that nothing will go wrong under pressure and fatigue. Zero-downtime migration takes more upfront engineering, but it eliminates the late-night gambles and gives you the ability to roll back at any point. That's not perfectionism — that's professionalism.

Connection Management

Connection pooling during migration

Database migrations add connection management complexity that most teams underestimate. During a dual-write phase, your application maintains connections to two databases simultaneously, effectively doubling your connection footprint. Without proper pooling, you will exhaust connection limits and create the very outage you are trying to avoid.

External Connection Poolers

PgBouncer (for PostgreSQL) and ProxySQL (for MySQL) sit between your application and the database, multiplexing hundreds of application connections into a smaller pool of actual database connections. During migration, deploy a pooler in front of both the source and target databases. Configure the pooler in transaction mode, not session mode, so connections are returned to the pool after each transaction rather than being held for the lifetime of the application connection. This is critical during dual-write, where each application request touches two databases and you cannot afford to hold two connections idle between requests.

For cloud-managed databases, use the provider's built-in pooling: Amazon RDS Proxy, Azure Database Flexible Server connection pooling, or Google Cloud SQL Auth Proxy. These handle TLS termination and IAM authentication alongside connection multiplexing, reducing operational surface area during an already complex migration.

Application-Level Pool Configuration

Your ORM or database driver has its own connection pool settings. During migration, reduce the maximum pool size for each database to half your normal limit. If your application normally uses a pool of 20 connections and you are dual-writing, configure 10 connections to the source and 10 to the target. Monitor pool wait times — if application threads are blocking on pool acquisition, you need to either increase pool limits (if the database can handle it) or optimize your queries to hold connections for shorter durations.

Set aggressive connection timeouts during migration. A connection checkout timeout of 5 seconds (rather than the default 30 seconds in many ORMs) ensures you fail fast and surface pool exhaustion as an error rather than a slow degradation that is hard to diagnose. Pair this with circuit breakers on the write path so a slow target database does not back up the connection pool and take down reads from the source.

Observability

Monitoring and alerting during cutover

The cutover phase — when you shift read traffic from source to target — is where migrations succeed or fail. You need real-time visibility into both databases and the application layer to make confident decisions about proceeding, pausing, or rolling back.

Replication Lag Monitoring

If you are using change data capture or logical replication to keep the target in sync, replication lag is your most important metric. Set up a heartbeat table that writes a timestamp to the source every second and measures when it appears on the target. Alert if lag exceeds your tolerance (typically 1-5 seconds for online migration). Do not proceed with cutover until lag is consistently below your threshold for at least 30 minutes.

Query Latency Percentiles

Track P50, P95, and P99 latency for both read and write queries against the target database before shifting production traffic. Compare these percentiles against the source database baseline. A P50 that matches but a P99 that is 10x higher indicates missing indexes or suboptimal query plans on the target. During traffic shift, monitor these percentiles per percentage step — if P99 degrades at 10% traffic, do not proceed to 50%.

Error Rate by Query Type

Aggregate error rates hide critical issues. A 0.1% overall error rate during cutover sounds acceptable until you discover that 100% of a specific query type is failing because of a schema difference. Break error monitoring down by query category: reads vs writes, by table, and by application endpoint. Set per-category alert thresholds that are tighter than your normal production alerting. During migration, a 0.01% error rate on writes should trigger investigation.

Data Consistency Dashboard

Build a dedicated dashboard that runs continuous row-count comparisons and checksum validations between source and target. Display: total row counts per table, delta since last check, checksum mismatches by table, and the timestamp of the last successful full reconciliation. This dashboard should be visible to the entire migration team during cutover — ideally on a shared screen or war room display. Any divergence between source and target is a reason to pause the cutover.

Validation

Testing strategies for migration validation

Migrations fail not because the data transfer is wrong, but because the application behaves differently against the new database in ways that unit tests do not cover. Here are the testing layers that catch real migration issues.

Schema Compatibility Tests

Write automated tests that compare the source and target schemas column by column. Check data types, nullable constraints, default values, indexes, and foreign keys. A VARCHAR(255) on MySQL becoming a TEXT on PostgreSQL changes storage behavior and index capabilities. A DATETIME column losing timezone information during migration will corrupt timestamps silently. These tests should run in CI against both databases and fail the build if schemas diverge in unexpected ways.

Shadow Traffic Replay

Capture a sample of production queries (reads and writes) from your application logs or query audit log. Replay them against the target database and compare results. For reads, compare result sets row by row. For writes, compare the resulting database state. This catches issues that synthetic tests miss: queries that rely on implicit type casting, collation differences, or database-specific SQL extensions. Tools like pt-query-digest (Percona) can capture and replay MySQL queries. For PostgreSQL, use pgbadger to analyze query logs and generate replay scripts.

Load Testing at Production Scale

Restore a production-sized dataset to the target database and run load tests that simulate your peak traffic pattern. Do not skip this step because the target database is “the same engine, just a newer version.” Query planner behavior changes between versions. Index selection heuristics change. Connection handling changes. A migration that performs perfectly with 1 million rows and 10 concurrent connections may fail catastrophically with 500 million rows and 200 connections because the query planner chooses a sequential scan instead of an index lookup at scale.

Rollback Drill

Practice the rollback procedure before you need it. Simulate a failed cutover: shift 10% of traffic to the target, inject a failure condition, and execute the rollback plan. Measure how long it takes and whether any data is lost during the rollback. If your rollback involves switching a feature flag, verify the flag propagation time. If it involves DNS changes, measure TTL-related lag. A rollback plan that has never been tested is not a plan — it is a hope.

Timeline

Real-world timeline for a typical database migration

Teams consistently underestimate how long zero-downtime migrations take. Here is a realistic timeline for migrating a production database with 100 million to 1 billion rows, based on projects we have executed. Adjust durations based on your schema complexity and team size.

Week 1-2

Discovery and Schema Analysis

Map the complete schema including stored procedures, triggers, views, and implicit dependencies. Identify every application that connects to the database — not just your primary application, but reporting tools, cron jobs, data science notebooks, and any third-party integrations with direct database access. Document the current connection count, peak QPS, and storage volume. This phase always takes longer than expected because undocumented consumers appear.

Week 3-4

Compatibility Layer and Dual-Write Implementation

Build the abstraction layer that lets application code work with both databases. Implement feature flags for read/write routing. Set up the dual-write path with error handling that ensures source writes succeed even if target writes fail (the source is still the authority). Deploy connection poolers. This phase includes writing the reconciliation jobs that will validate data consistency throughout the migration.

Week 5-7

Data Migration and Validation

Run the initial bulk data copy. For large datasets, this can take days — a billion-row table with complex indexes may take 48-72 hours to fully copy and index. Start continuous replication to keep the target in sync with ongoing writes. Run reconciliation jobs continuously. Fix data issues as they surface: encoding mismatches, truncated values, timezone conversions, and sequences that need resetting. This is typically the longest phase and the one with the most surprises.

Week 8-9

Shadow Reads and Performance Validation

Enable shadow reads to compare query results between source and target in real time. Run load tests against the target at production scale. Tune query plans, add missing indexes, and adjust database configuration parameters. Conduct the rollback drill. By the end of this phase, you should have high confidence that the target database handles every query your application issues with acceptable latency.

Week 10

Gradual Cutover and Cleanup

Shift read traffic in increments: 1%, 5%, 10%, 25%, 50%, 100%. Each step runs for at least a few hours (ideally a full business day) with monitoring. If any step shows regression, pause and investigate before proceeding. Once at 100% reads on the target, keep the source database running in receive-only mode for one week as a safety net. Then decommission: remove dual-write code, shut down the source, and archive backups. Run a retrospective documenting what worked, what surprised you, and what you would change for the next migration.

Planning a database migration?

We've migrated billions of records across PostgreSQL, MySQL, MongoDB, and cloud-native databases — all without downtime. Let's plan yours.

Legacy modernization Talk to us

Start Your Project

Let's discuss what we can build together

Whether you're modernizing legacy systems, launching a new product, or solving a complex technical challenge, we'd welcome the opportunity to understand your needs.

Start a Conversation connect@areakpi.com