Database health check playbook: monitor, diagnose, optimize
Reactive database management is a losing strategy
Wait for users to report slowness before investigating queries
Automated slow query analysis flags regressions before users notice
Discover disk is full when writes start failing at midnight
Storage growth trending triggers capacity alerts weeks in advance
Add indexes randomly hoping something improves query speed
Index usage audits identify exactly which indexes to add, drop, or rebuild
Assume backups work because the cron job is configured
Monthly restore drills validate that backups actually produce a working database
Eight metrics that reveal your database's true health
Query Performance
Slow queries are the canary in the coal mine. Analyze execution plans, identify full table scans, and track p95/p99 query latency over time. A query that runs fine at 100 rows will cripple your system at 10 million.
p99 query latency < 200ms
Connection Pool Utilization
Connection exhaustion causes cascading failures that look nothing like a database problem. Monitor active vs idle connections, track pool wait times, and set hard limits before your application starts queuing.
Pool utilization < 80%
Buffer & Cache Hit Ratios
Every cache miss is a disk read. Your buffer pool should serve 99%+ of reads from memory. A declining hit ratio signals either insufficient memory allocation or a working set that has outgrown your cache.
Cache hit ratio > 99%
Disk I/O Latency
When the buffer pool can not absorb the load, disk I/O becomes the bottleneck. Track read/write IOPS, average latency, and I/O queue depth. Sustained latency above 10ms on SSDs points to either saturation or hardware degradation.
Avg I/O latency < 5ms (SSD)
Replication Lag
Replication lag breaks read-after-write consistency and turns your replicas into liabilities. Monitor lag in seconds, not just bytes. A replica that falls behind during peak load is worse than having no replica at all.
Replication lag < 1 second
Lock Contention & Deadlocks
Locks are necessary; contention is not. Track lock wait times, identify long-running transactions holding row or table locks, and investigate deadlock graphs. High contention often reveals a schema or access pattern problem.
Deadlocks < 1 per hour
Index Efficiency
Unused indexes waste write performance and storage. Missing indexes force full table scans. Audit your indexes quarterly: drop what is never read, add what the query planner keeps requesting, and rebuild fragmented ones.
Index hit rate > 95%
Storage Growth Trends
Storage surprises are preventable. Track table sizes, BLOB growth, WAL/binlog accumulation, and temp tablespace usage. Project growth rates and set alerts at 70% capacity so you act before the disk fills at 2 AM.
Storage headroom > 30%
Health check frequency matrix
Health checks by database engine
PostgreSQL
Strengths
Rich statistics views, pg_stat_statements for query analysis, EXPLAIN ANALYZE with buffer info
Watch For
Table bloat from MVCC, autovacuum tuning, connection limits (default 100), shared_buffers sizing
Key metric: Dead tuple ratio via pg_stat_user_tables
MySQL / MariaDB
Strengths
Performance Schema, sys schema views, slow query log with microsecond precision
Watch For
InnoDB buffer pool size (aim for 70-80% of RAM), binary log disk usage, replication thread lag
Key metric: Innodb_buffer_pool_read_requests vs Innodb_buffer_pool_reads
Oracle
Strengths
Comprehensive wait event model, SQL Tuning Advisor, segment-level statistics
Watch For
SGA sizing, PGA memory pressure, undo tablespace management, archive log destination fill rate
Key metric: DB Time vs elapsed time ratio
SQL Server
Strengths
Query Store for regression detection, DMVs for real-time diagnostics, built-in intelligent tuning
Watch For
TempDB contention, memory grants and RESOURCE_SEMAPHORE waits, statistics staleness, page life expectancy
Key metric: Page Life Expectancy (target > 300 seconds)
MongoDB
Strengths
Built-in profiler, explain() with executionStats, $indexStats aggregation
Watch For
WiredTiger cache pressure, collection scan frequency, shard key cardinality, oplog window size
Key metric: Cache dirty percentage in WiredTiger stats
Alert on these thresholds, not on guesses
Six performance tuning steps that deliver immediate results
Identify Top Offenders
Sort queries by total execution time, not individual duration. A query that runs 50ms but executes 100,000 times per hour consumes more resources than a 10-second report query that runs once daily. Focus optimization where cumulative impact is highest.
Fix Missing Indexes
Read execution plans, not guesses. Look for sequential scans on filtered columns, sort operations on unindexed fields, and join conditions without supporting indexes. Add composite indexes that match your most common WHERE + ORDER BY patterns.
Optimize Connection Pooling
Stop opening a new connection per request. Use PgBouncer, ProxySQL, or application-level pools with a maximum that matches your database capabilities. A PostgreSQL server with 100 max_connections does not need 500 pool connections waiting in line.
Implement Query Result Caching
Cache frequently read, rarely changed data at the application layer with Redis or Memcached. Set TTLs based on data freshness requirements, not arbitrary timeouts. Invalidate on writes, not on timers, for data that must be current.
Right-Size Your Buffers
Allocate 70-80% of available RAM to the database buffer pool on dedicated servers. Monitor hit ratios after changes. Oversized buffers waste memory the OS could use for filesystem cache; undersized buffers cause constant disk reads.
Partition Large Tables
Tables beyond 100 million rows benefit from partitioning by date, region, or tenant. Partition pruning eliminates scanning irrelevant data. Maintenance operations like VACUUM or OPTIMIZE run per-partition, reducing lock duration.
Warning signs that need immediate attention
If any of these conditions persist for more than 15 minutes during business hours, escalate immediately. These are not optimization opportunities — they are active risks to data integrity and service availability.
Query execution times exceeding 5 seconds in OLTP workloads
Replication lag consistently above 30 seconds
Connection pool utilization sustained above 90%
Buffer cache hit ratio dropping below 95%
Disk I/O queue depth exceeding 4 on SSD storage
Deadlock frequency increasing week over week
Storage growing faster than projected capacity runway
Backup jobs failing or skipping scheduled windows
Scale up or scale out: a decision framework
Vertical scaling fits when
- Single-query performance is the bottleneck
- Working set exceeds available memory
- I/O latency is high but throughput is manageable
- Application relies on strong consistency and joins
- Operational complexity needs to stay low
Horizontal scaling fits when
- Read throughput exceeds what one node can serve
- Data volume exceeds single-node storage limits
- Geographic distribution is required for latency
- Write volume demands sharding across nodes
- High availability requires multi-node redundancy
Your backups are only as good as your last restore
Verify backup completion daily
AutomatedRestore backup to staging monthly
ScheduledMeasure actual RTO against SLA quarterly
DrillTest point-in-time recovery accuracy
QuarterlyValidate cross-region failover path
Semi-annualAudit backup encryption and access controls
AnnualA healthy database is not an accident — it's a practice.
The organizations that never have database emergencies are not luckier than everyone else. They run the same health checks every day, every week, every quarter. They know their baseline metrics and notice when trends shift. They test their recovery procedures before they need them. Database health is not a state you achieve — it's a discipline you maintain. Start with the eight vital signs. Build the cadence. Make it boring. Boring databases are reliable databases.