Data Engineering10 min read

Database health check playbook: monitor, diagnose, optimize

Most teams only look at their database when something breaks. By then the slow queries have been accumulating for weeks, the connection pool has been silently maxing out, and the storage runway is shorter than anyone realized. This playbook gives you a systematic approach to database health — so you catch problems in metrics, not in incident channels.
The Case for Proactive

Reactive database management is a losing strategy

A database that “seems fine” is not a healthy database. It is an uninspected database. The difference between teams that experience routine outages and teams that run stable data infrastructure comes down to one practice: scheduled, disciplined health checks with defined thresholds and accountable owners.

Wait for users to report slowness before investigating queries

Automated slow query analysis flags regressions before users notice

Discover disk is full when writes start failing at midnight

Storage growth trending triggers capacity alerts weeks in advance

Add indexes randomly hoping something improves query speed

Index usage audits identify exactly which indexes to add, drop, or rebuild

Assume backups work because the cron job is configured

Monthly restore drills validate that backups actually produce a working database

Vital Signs

Eight metrics that reveal your database's true health

These are not vanity metrics. Each one maps directly to a failure mode that will eventually surface in production if left unmonitored. Track all eight, set thresholds for each, and review trends weekly.

Query Performance

Slow queries are the canary in the coal mine. Analyze execution plans, identify full table scans, and track p95/p99 query latency over time. A query that runs fine at 100 rows will cripple your system at 10 million.

p99 query latency < 200ms

Connection Pool Utilization

Connection exhaustion causes cascading failures that look nothing like a database problem. Monitor active vs idle connections, track pool wait times, and set hard limits before your application starts queuing.

Pool utilization < 80%

Buffer & Cache Hit Ratios

Every cache miss is a disk read. Your buffer pool should serve 99%+ of reads from memory. A declining hit ratio signals either insufficient memory allocation or a working set that has outgrown your cache.

Cache hit ratio > 99%

Disk I/O Latency

When the buffer pool can not absorb the load, disk I/O becomes the bottleneck. Track read/write IOPS, average latency, and I/O queue depth. Sustained latency above 10ms on SSDs points to either saturation or hardware degradation.

Avg I/O latency < 5ms (SSD)

Replication Lag

Replication lag breaks read-after-write consistency and turns your replicas into liabilities. Monitor lag in seconds, not just bytes. A replica that falls behind during peak load is worse than having no replica at all.

Replication lag < 1 second

Lock Contention & Deadlocks

Locks are necessary; contention is not. Track lock wait times, identify long-running transactions holding row or table locks, and investigate deadlock graphs. High contention often reveals a schema or access pattern problem.

Deadlocks < 1 per hour

Index Efficiency

Unused indexes waste write performance and storage. Missing indexes force full table scans. Audit your indexes quarterly: drop what is never read, add what the query planner keeps requesting, and rebuild fragmented ones.

Index hit rate > 95%

Storage Growth Trends

Storage surprises are preventable. Track table sizes, BLOB growth, WAL/binlog accumulation, and temp tablespace usage. Project growth rates and set alerts at 70% capacity so you act before the disk fills at 2 AM.

Storage headroom > 30%

Cadence

Health check frequency matrix

Not every check needs to happen every day. Some metrics demand daily attention, others reveal their value only in monthly or quarterly trends. The key is consistency: a scheduled check that actually runs beats an ambitious checklist that gets skipped.
Daily
Checks:Slow query log review, replication lag, connection pool usage, backup completion status, disk space utilization
Owner:On-call DBA / SRE
Duration:15 min
Weekly
Checks:Index usage analysis, lock contention trends, cache hit ratio trends, query plan regressions, storage growth rate
Owner:DBA Team
Duration:1 hour
Monthly
Checks:Full index audit, table bloat analysis, configuration drift review, user permission audit, backup restoration test
Owner:DBA + Engineering Lead
Duration:3 hours
Quarterly
Checks:Capacity planning review, schema optimization, disaster recovery drill, vendor patch assessment, hardware lifecycle check
Owner:DBA + Architecture Team
Duration:Half day
Engine-Specific

Health checks by database engine

Every database engine has its own instrumentation, its own failure modes, and its own tuning knobs. Generic advice gets you 60% of the way. Engine-specific knowledge covers the rest.

PostgreSQL

pg_stat_activity, pg_stat_user_tables

Strengths

Rich statistics views, pg_stat_statements for query analysis, EXPLAIN ANALYZE with buffer info

Watch For

Table bloat from MVCC, autovacuum tuning, connection limits (default 100), shared_buffers sizing

Key metric: Dead tuple ratio via pg_stat_user_tables

MySQL / MariaDB

SHOW ENGINE INNODB STATUS

Strengths

Performance Schema, sys schema views, slow query log with microsecond precision

Watch For

InnoDB buffer pool size (aim for 70-80% of RAM), binary log disk usage, replication thread lag

Key metric: Innodb_buffer_pool_read_requests vs Innodb_buffer_pool_reads

Oracle

AWR Reports, ASH (Active Session History)

Strengths

Comprehensive wait event model, SQL Tuning Advisor, segment-level statistics

Watch For

SGA sizing, PGA memory pressure, undo tablespace management, archive log destination fill rate

Key metric: DB Time vs elapsed time ratio

SQL Server

sys.dm_exec_query_stats, sys.dm_os_wait_stats

Strengths

Query Store for regression detection, DMVs for real-time diagnostics, built-in intelligent tuning

Watch For

TempDB contention, memory grants and RESOURCE_SEMAPHORE waits, statistics staleness, page life expectancy

Key metric: Page Life Expectancy (target > 300 seconds)

MongoDB

db.serverStatus(), db.currentOp()

Strengths

Built-in profiler, explain() with executionStats, $indexStats aggregation

Watch For

WiredTiger cache pressure, collection scan frequency, shard key cardinality, oplog window size

Key metric: Cache dirty percentage in WiredTiger stats

Monitoring Thresholds

Alert on these thresholds, not on guesses

Good alerting is specific, actionable, and tuned to avoid fatigue. Start with these thresholds and adjust based on your workload profile. Every alert should have a documented runbook attached.
200msp99 query latency threshold before alerting engineering on-call
80%connection pool utilization warning before pool exhaustion cascades
70%disk utilization alert point giving 30% headroom for growth spikes
< 1smaximum acceptable replication lag before reads become inconsistent
Quick Wins

Six performance tuning steps that deliver immediate results

You do not need a six-month optimization project to see improvement. These six steps address the most common performance bottlenecks we encounter in database health assessments. Most teams see measurable gains within the first week.
01

Identify Top Offenders

Sort queries by total execution time, not individual duration. A query that runs 50ms but executes 100,000 times per hour consumes more resources than a 10-second report query that runs once daily. Focus optimization where cumulative impact is highest.

02

Fix Missing Indexes

Read execution plans, not guesses. Look for sequential scans on filtered columns, sort operations on unindexed fields, and join conditions without supporting indexes. Add composite indexes that match your most common WHERE + ORDER BY patterns.

03

Optimize Connection Pooling

Stop opening a new connection per request. Use PgBouncer, ProxySQL, or application-level pools with a maximum that matches your database capabilities. A PostgreSQL server with 100 max_connections does not need 500 pool connections waiting in line.

04

Implement Query Result Caching

Cache frequently read, rarely changed data at the application layer with Redis or Memcached. Set TTLs based on data freshness requirements, not arbitrary timeouts. Invalidate on writes, not on timers, for data that must be current.

05

Right-Size Your Buffers

Allocate 70-80% of available RAM to the database buffer pool on dedicated servers. Monitor hit ratios after changes. Oversized buffers waste memory the OS could use for filesystem cache; undersized buffers cause constant disk reads.

06

Partition Large Tables

Tables beyond 100 million rows benefit from partitioning by date, region, or tenant. Partition pruning eliminates scanning irrelevant data. Maintenance operations like VACUUM or OPTIMIZE run per-partition, reducing lock duration.

Red Flags

Warning signs that need immediate attention

If any of these conditions persist for more than 15 minutes during business hours, escalate immediately. These are not optimization opportunities — they are active risks to data integrity and service availability.

!

Query execution times exceeding 5 seconds in OLTP workloads

!

Replication lag consistently above 30 seconds

!

Connection pool utilization sustained above 90%

!

Buffer cache hit ratio dropping below 95%

!

Disk I/O queue depth exceeding 4 on SSD storage

!

Deadlock frequency increasing week over week

!

Storage growing faster than projected capacity runway

!

Backup jobs failing or skipping scheduled windows

Capacity Planning

Scale up or scale out: a decision framework

The answer is not always “add more servers.” Vertical scaling is simpler, cheaper to operate, and often sufficient. Horizontal scaling introduces distributed systems complexity that your team must be ready to manage. Choose based on the actual bottleneck, not on what sounds more impressive.

Vertical scaling fits when

  • Single-query performance is the bottleneck
  • Working set exceeds available memory
  • I/O latency is high but throughput is manageable
  • Application relies on strong consistency and joins
  • Operational complexity needs to stay low

Horizontal scaling fits when

  • Read throughput exceeds what one node can serve
  • Data volume exceeds single-node storage limits
  • Geographic distribution is required for latency
  • Write volume demands sharding across nodes
  • High availability requires multi-node redundancy
Disaster Recovery

Your backups are only as good as your last restore

A backup that has never been restored is a hope, not a plan. Disaster recovery readiness means testing the full recovery path — not just verifying that backup files exist on disk. Schedule these checks and treat failures as P1 issues.

Verify backup completion daily

Automated

Restore backup to staging monthly

Scheduled

Measure actual RTO against SLA quarterly

Drill

Test point-in-time recovery accuracy

Quarterly

Validate cross-region failover path

Semi-annual

Audit backup encryption and access controls

Annual
The Discipline

A healthy database is not an accident — it's a practice.

The organizations that never have database emergencies are not luckier than everyone else. They run the same health checks every day, every week, every quarter. They know their baseline metrics and notice when trends shift. They test their recovery procedures before they need them. Database health is not a state you achieve — it's a discipline you maintain. Start with the eight vital signs. Build the cadence. Make it boring. Boring databases are reliable databases.

Need a database health assessment?

We run comprehensive health checks across PostgreSQL, MySQL, Oracle, SQL Server, and MongoDB — identifying performance bottlenecks, capacity risks, and recovery gaps before they become incidents.
Start Your Project

Let's discuss what we can build together

Whether you're modernizing legacy systems, launching a new product, or solving a complex technical challenge, we'd welcome the opportunity to understand your needs.