Data Engineering10 min read

Database health check playbook: monitor, diagnose, optimize

Most teams only look at their database when something breaks. By then the slow queries have been accumulating for weeks, the connection pool has been silently maxing out, and the storage runway is shorter than anyone realized. This playbook gives you a systematic approach to database health — so you catch problems in metrics, not in incident channels.

The Case for Proactive

Reactive database management is a losing strategy

A database that “seems fine” is not a healthy database. It is an uninspected database. The difference between teams that experience routine outages and teams that run stable data infrastructure comes down to one practice: scheduled, disciplined health checks with defined thresholds and accountable owners.

Wait for users to report slowness before investigating queries

✓

Automated slow query analysis flags regressions before users notice

Discover disk is full when writes start failing at midnight

✓

Storage growth trending triggers capacity alerts weeks in advance

Add indexes randomly hoping something improves query speed

✓

Index usage audits identify exactly which indexes to add, drop, or rebuild

Assume backups work because the cron job is configured

✓

Monthly restore drills validate that backups actually produce a working database

Vital Signs

Eight metrics that reveal your database's true health

These are not vanity metrics. Each one maps directly to a failure mode that will eventually surface in production if left unmonitored. Track all eight, set thresholds for each, and review trends weekly.

Query Performance

Slow queries are the canary in the coal mine. Analyze execution plans, identify full table scans, and track p95/p99 query latency over time. A query that runs fine at 100 rows will cripple your system at 10 million.

p99 query latency < 200ms

Connection Pool Utilization

Connection exhaustion causes cascading failures that look nothing like a database problem. Monitor active vs idle connections, track pool wait times, and set hard limits before your application starts queuing.

Pool utilization < 80%

Buffer & Cache Hit Ratios

Every cache miss is a disk read. Your buffer pool should serve 99%+ of reads from memory. A declining hit ratio signals either insufficient memory allocation or a working set that has outgrown your cache.

Cache hit ratio > 99%

Disk I/O Latency

When the buffer pool can not absorb the load, disk I/O becomes the bottleneck. Track read/write IOPS, average latency, and I/O queue depth. Sustained latency above 10ms on SSDs points to either saturation or hardware degradation.

Avg I/O latency < 5ms (SSD)

Replication Lag

Replication lag breaks read-after-write consistency and turns your replicas into liabilities. Monitor lag in seconds, not just bytes. A replica that falls behind during peak load is worse than having no replica at all.

Replication lag < 1 second

Lock Contention & Deadlocks

Locks are necessary; contention is not. Track lock wait times, identify long-running transactions holding row or table locks, and investigate deadlock graphs. High contention often reveals a schema or access pattern problem.

Deadlocks < 1 per hour

Index Efficiency

Unused indexes waste write performance and storage. Missing indexes force full table scans. Audit your indexes quarterly: drop what is never read, add what the query planner keeps requesting, and rebuild fragmented ones.

Index hit rate > 95%

Storage Growth Trends

Storage surprises are preventable. Track table sizes, BLOB growth, WAL/binlog accumulation, and temp tablespace usage. Project growth rates and set alerts at 70% capacity so you act before the disk fills at 2 AM.

Storage headroom > 30%

Cadence

Health check frequency matrix

Not every check needs to happen every day. Some metrics demand daily attention, others reveal their value only in monthly or quarterly trends. The key is consistency: a scheduled check that actually runs beats an ambitious checklist that gets skipped.

Cadence

Checks

Owner

Duration

Daily

Checks:Slow query log review, replication lag, connection pool usage, backup completion status, disk space utilization

Owner:On-call DBA / SRE

Duration:15 min

Weekly

Checks:Index usage analysis, lock contention trends, cache hit ratio trends, query plan regressions, storage growth rate

Owner:DBA Team

Duration:1 hour

Monthly

Checks:Full index audit, table bloat analysis, configuration drift review, user permission audit, backup restoration test

Owner:DBA + Engineering Lead

Duration:3 hours

Quarterly

db.serverStatus(), db.currentOp()

Strengths

Built-in profiler, explain() with executionStats, $indexStats aggregation

Watch For

WiredTiger cache pressure, collection scan frequency, shard key cardinality, oplog window size

Key metric: Cache dirty percentage in WiredTiger stats

Monitoring Thresholds

Alert on these thresholds, not on guesses

Good alerting is specific, actionable, and tuned to avoid fatigue. Start with these thresholds and adjust based on your workload profile. Every alert should have a documented runbook attached.

200msp99 query latency threshold before alerting engineering on-call

80%connection pool utilization warning before pool exhaustion cascades

70%disk utilization alert point giving 30% headroom for growth spikes

< 1smaximum acceptable replication lag before reads become inconsistent

Quick Wins

Six performance tuning steps that deliver immediate results

You do not need a six-month optimization project to see improvement. These six steps address the most common performance bottlenecks we encounter in database health assessments. Most teams see measurable gains within the first week.

Identify Top Offenders

Sort queries by total execution time, not individual duration. A query that runs 50ms but executes 100,000 times per hour consumes more resources than a 10-second report query that runs once daily. Focus optimization where cumulative impact is highest.

Fix Missing Indexes

Read execution plans, not guesses. Look for sequential scans on filtered columns, sort operations on unindexed fields, and join conditions without supporting indexes. Add composite indexes that match your most common WHERE + ORDER BY patterns.

Optimize Connection Pooling

Stop opening a new connection per request. Use PgBouncer, ProxySQL, or application-level pools with a maximum that matches your database capabilities. A PostgreSQL server with 100 max_connections does not need 500 pool connections waiting in line.

Implement Query Result Caching

Cache frequently read, rarely changed data at the application layer with Redis or Memcached. Set TTLs based on data freshness requirements, not arbitrary timeouts. Invalidate on writes, not on timers, for data that must be current.

Right-Size Your Buffers

Allocate 70-80% of available RAM to the database buffer pool on dedicated servers. Monitor hit ratios after changes. Oversized buffers waste memory the OS could use for filesystem cache; undersized buffers cause constant disk reads.

Partition Large Tables

Tables beyond 100 million rows benefit from partitioning by date, region, or tenant. Partition pruning eliminates scanning irrelevant data. Maintenance operations like VACUUM or OPTIMIZE run per-partition, reducing lock duration.

Red Flags

Warning signs that need immediate attention

If any of these conditions persist for more than 15 minutes during business hours, escalate immediately. These are not optimization opportunities — they are active risks to data integrity and service availability.

Query execution times exceeding 5 seconds in OLTP workloads

Replication lag consistently above 30 seconds

Connection pool utilization sustained above 90%

Buffer cache hit ratio dropping below 95%

Disk I/O queue depth exceeding 4 on SSD storage

Deadlock frequency increasing week over week

Storage growing faster than projected capacity runway

Backup jobs failing or skipping scheduled windows

Capacity Planning

Scale up or scale out: a decision framework

The answer is not always “add more servers.” Vertical scaling is simpler, cheaper to operate, and often sufficient. Horizontal scaling introduces distributed systems complexity that your team must be ready to manage. Choose based on the actual bottleneck, not on what sounds more impressive.

Vertical scaling fits when

Single-query performance is the bottleneck
Working set exceeds available memory
I/O latency is high but throughput is manageable
Application relies on strong consistency and joins
Operational complexity needs to stay low

Horizontal scaling fits when

Read throughput exceeds what one node can serve
Data volume exceeds single-node storage limits
Geographic distribution is required for latency
Write volume demands sharding across nodes
High availability requires multi-node redundancy

Disaster Recovery

Your backups are only as good as your last restore

A backup that has never been restored is a hope, not a plan. Disaster recovery readiness means testing the full recovery path — not just verifying that backup files exist on disk. Schedule these checks and treat failures as P1 issues.

Verify backup completion daily

Automated

Restore backup to staging monthly

Scheduled

Measure actual RTO against SLA quarterly

Drill

Test point-in-time recovery accuracy

Quarterly

Validate cross-region failover path

Semi-annual

Audit backup encryption and access controls

Annual

The Discipline

A healthy database is not an accident — it's a practice.

The organizations that never have database emergencies are not luckier than everyone else. They run the same health checks every day, every week, every quarter. They know their baseline metrics and notice when trends shift. They test their recovery procedures before they need them. Database health is not a state you achieve — it's a discipline you maintain. Start with the eight vital signs. Build the cadence. Make it boring. Boring databases are reliable databases.

Need a database health assessment?

We run comprehensive health checks across PostgreSQL, MySQL, Oracle, SQL Server, and MongoDB — identifying performance bottlenecks, capacity risks, and recovery gaps before they become incidents.

Talk to us Our data services

Start Your Project

Let's discuss what we can build together

Whether you're modernizing legacy systems, launching a new product, or solving a complex technical challenge, we'd welcome the opportunity to understand your needs.

Start a Conversation connect@areakpi.com