Engineering change management: deploying software safely
Four types of changes and how to handle each
Standard
Routine, low-risk changes that follow a pre-approved process. Deployments of well-tested features during business hours with automated rollback.
e.g. Config updates, dependency bumps, UI copy changes
Normal
Changes that require review and approval before execution. New functionality, schema migrations, or infrastructure modifications.
e.g. New API endpoints, database migrations, service scaling
Emergency
Unplanned changes needed to restore service or patch critical vulnerabilities. Bypasses standard approval but requires post-implementation review.
e.g. Security patches, outage fixes, data corruption repairs
Pre-Approved
Repeatable changes with proven safety records. Automated pipeline handles validation, deployment, and verification without manual gates.
e.g. Automated scaling, certificate renewals, log rotation
Five steps from request to resolution
Request & Classify
Every change starts with a request. Classify it by type, urgency, and blast radius. This determines the approval path and deployment window.
Impact Assessment
Map affected services, data flows, and downstream consumers. Identify who needs to know and what could break if the change goes wrong.
Review & Approve
Peer review for code. Architecture review for infrastructure. Change advisory board for cross-team impacts. Match the rigor to the risk.
Deploy & Verify
Execute the change using automated pipelines. Run smoke tests, validate metrics, and confirm service health before declaring success.
Close & Learn
Document what happened. If something went wrong, run a blameless retrospective. Feed lessons back into the process to prevent recurrence.
Matching deployment strategy to change risk
Deployment patterns that reduce blast radius
Feature flags
Decouple deployment from release. Ship code to production behind a flag, enable it gradually, and kill it instantly if something breaks.
Canary releases
Route a small percentage of traffic to the new version. Monitor error rates and latency. Promote to full traffic only when metrics are healthy.
Blue-green deployments
Maintain two identical production environments. Deploy to the inactive one, validate, then switch traffic. Rollback is a DNS change.
Rollback procedures
Every deployment must have a documented rollback plan tested before execution. If rollback takes longer than 5 minutes, the change needs more preparation.
Progressive delivery
Combine feature flags with canary analysis to automate promotion decisions. Let the system decide when a change is safe to go wide.
Immutable artifacts
Build once, deploy everywhere. The same container image runs in dev, staging, and production. No configuration drift, no snowflake servers.
Running a change advisory board that engineers respect
Replace approval gates with risk-based routing
The traditional CAB model — where every change waits for a weekly meeting — is incompatible with continuous delivery. Instead, implement risk-based routing. Standard and pre-approved changes flow through automated pipelines with no human gate. Normal changes require peer review and a lightweight async approval from the on-call lead. Only high-risk changes — those affecting shared infrastructure, database schemas, or cross-team integrations — go to a synchronous CAB review. This typically means fewer than 10% of changes need CAB involvement, and those are the ones that actually benefit from collective scrutiny.
Structure the CAB for speed, not ceremony
An effective engineering CAB meets daily for 15 minutes, not weekly for two hours. Attendees include: the change owner (who presents), the on-call engineer (who assesses production impact), and a senior engineer from any downstream team affected by the change. The format is standardized: blast radius, rollback plan, verification criteria, deployment window. If the change owner cannot articulate all four in under three minutes, the change is not ready. Skip the PowerPoint — a structured Slack thread or Confluence template with pre-filled fields works better than a meeting for most reviews.
Measure CAB effectiveness, not just throughput
Track three metrics for your CAB: the percentage of changes that pass through without modification (if it is above 95%, the CAB is not adding value), the number of production incidents caused by CAB-approved changes (this should trend toward zero), and the average time from change submission to approval (this should be under four hours for normal changes). If your CAB is rejecting nothing and catching nothing, it is security theater. If it is rejecting everything and slowing delivery, it is a bottleneck. Tune the risk classification criteria until the CAB reviews only the changes where human judgment genuinely matters.
Automated change impact analysis
Build a dependency graph from your CI pipeline
Your CI system already knows which services depend on which. Extract this information into a queryable dependency graph. When a PR is opened, automatically identify every downstream service that consumes the changed API, every shared library that includes the modified module, and every deployment pipeline that would be triggered. Present this as a comment on the PR: “This change affects: Order Service, Payment Gateway, Notification Worker. Last incident involving these services: 14 days ago.” This gives reviewers context that would otherwise require tribal knowledge.
Static analysis for configuration changes
Configuration changes are the leading cause of production incidents, yet they often bypass code review entirely. Treat infrastructure-as-code, feature flag definitions, environment variables, and Kubernetes manifests with the same rigor as application code. Run Terraform plan, Helm diff, or Kubernetes dry-run in CI and post the output as a PR comment. Flag any change that modifies resource limits, networking rules, or IAM permissions for mandatory review. The diff between “what the config says now” and “what it will say after this change” should be visible to every reviewer without running anything locally.
Risk scoring with historical incident data
Build a simple risk scoring model that combines: the number of files changed, the blast radius (how many services are affected), the historical incident rate for the affected components, and whether the change includes a database migration. Weight these factors based on your organization's actual incident history. A score above a threshold triggers additional review gates automatically. Over time, calibrate the weights by comparing predicted risk scores with actual incident outcomes. This turns change management from a subjective judgment call into a data-driven process.
Database schema change procedures that prevent outages
The expand-contract pattern for zero-downtime migrations
Never rename or remove a column in a single deployment. Instead, use a three-phase approach. Phase one (expand): add the new column alongside the old one, deploy application code that writes to both columns but reads from the old one. Phase two (migrate): backfill the new column with data from the old one, then switch reads to the new column. Phase three (contract): once all application versions in production read from the new column, remove the old column. Each phase is a separate deployment that can be rolled back independently. This takes longer than a single ALTER TABLE, but it eliminates downtime and data loss risk entirely.
Migration review checklist
Before approving any database migration, verify: Does the migration acquire an exclusive lock on a high-traffic table? If so, can it complete within the lock timeout threshold? Has the migration been tested against a production-sized dataset, not just a dev database with 100 rows? Is the migration backward-compatible with the currently running application version? Is there a corresponding rollback migration that has been tested? For Postgres, check that CREATE INDEX uses CONCURRENTLY. For MySQL, verify that ALTER TABLE on large tables uses pt-online-schema-change or gh-ost. These are not optional safety measures — they are the difference between a routine deployment and a 3 AM incident.
Separate schema deployments from application deployments
Run database migrations as a distinct pipeline step that executes before the application deployment. This creates a clear separation of concerns and allows you to verify that the schema change succeeded before any application code depends on it. If the migration fails, the application deployment is automatically blocked. Use a migration tool that supports idempotent operations — Flyway, Liquibase, or Alembic — so that retrying a failed migration does not create duplicate schema objects. Tag each migration with the PR number and deployment ID for full traceability.
Post-deployment verification that catches what tests miss
The five-minute smoke test window
Immediately after deployment, run a suite of critical-path smoke tests against production. These are not your full test suite — they are the 10 to 15 scenarios that, if broken, mean the deployment must be rolled back immediately. Examples: can a user log in, can the checkout flow complete, does the API return 200 on the health endpoint, are background jobs being enqueued. If any smoke test fails, trigger an automatic rollback. The entire window should complete in under five minutes. If your smoke tests take longer, you have too many of them — move the slower ones to a post-deployment monitoring phase.
Metric-based deployment gates
Define quantitative thresholds that must hold for 15 minutes after deployment before the release is considered stable. Error rate must stay below 0.1% (or whatever your baseline is). P99 latency must not increase by more than 20%. CPU and memory utilization must remain within normal bands. Queue depth for background jobs must not grow unbounded. Wire these checks into your deployment pipeline so that a canary promotion or blue-green switch only happens when the metrics confirm health. Tools like Argo Rollouts, Flagger, and Spinnaker can automate these gates natively.
The post-deployment communication protocol
Every deployment should produce a brief notification in a shared channel (Slack, Teams, or your ChatOps tool of choice) that includes: what changed (link to the PR or changelog), who deployed it, the verification status (smoke tests passed, metrics within threshold), and the rollback command if needed. This gives the entire engineering organization visibility into what is changing in production at any given moment. When an incident occurs, the first question is always “what changed recently?” — this notification stream answers it instantly. For high-risk changes, add a 30-minute soak period where the deployer actively monitors dashboards before marking the deployment as complete.
The goal isn't zero changes — it's zero surprises.
High-performing teams deploy more frequently, not less. The difference is that every change is classified, reviewed, and deployed through a process designed to catch problems before users do. Speed and safety aren't opposites — they're the same discipline applied consistently.