Database Monitoring Essentials: Prevent the Most Common Cause of Outages

Ask any SRE team what causes the most outages and they'll likely say the same thing: the database. Connection pool exhaustion, slow queries, replication lag, disk space — databases have more failure modes than any other component in your stack.

The good news? Most database failures are predictable and preventable with the right monitoring.

The Most Common Database Failures

Connection Pool Exhaustion

Your application has a limited number of database connections. When they're all in use, new requests queue up and eventually timeout. Users see slow responses followed by errors.

Monitor: Active connections vs. max connections. Alert at 80% utilization.

Slow Queries

One poorly optimized query can bring down an entire application. A query that worked fine with 10,000 rows becomes a monster with 10 million.

Monitor: Query execution time P95. Alert when it exceeds 2x your baseline.

Replication Lag

If you use read replicas, lag means users see stale data. In extreme cases, they might see data that appears to go backward in time.

Monitor: Seconds of replication lag. Alert above 5 seconds for most applications.

Disk Space

Databases need disk space for data, indexes, transaction logs, and temporary files. Running out of disk space causes write failures and potential corruption.

Monitor: Disk usage percentage. Alert at 80%. Take action at 90%.

Lock Contention

Multiple transactions competing for the same rows create locks. Excessive locking slows everything down and can cause deadlocks that require manual intervention.

Monitor: Lock wait time and deadlock frequency.

Backup Failures

Not a real-time issue, but discovering your backups haven't been running when you need to restore is catastrophic.

Monitor: Backup completion via heartbeat monitoring. Alert if the daily backup heartbeat is missed.

What to Monitor by Database Type

PostgreSQL

Active connections / max_connections
Transaction rate (TPS)
Tuple operations (inserts, updates, deletes)
Cache hit ratio (should be >99%)
Replication lag (if using replicas)
Table bloat and vacuum activity
Disk usage

MySQL

Threads connected / max_connections
Queries per second
Slow query count
InnoDB buffer pool hit ratio
Replication lag (Seconds_Behind_Master)
Binary log disk usage

MongoDB

Current connections / max connections
Operations per second (opcounters)
Document operations (query, insert, update, delete)
Replication oplog window
WiredTiger cache usage
Lock percentage

Redis

Connected clients / maxclients
Memory usage / maxmemory
Hit rate (keyspace_hits / total)
Evicted keys (should be 0 if possible)
Replication lag
RDB/AOF last save status

Setting Up Database Monitoring

Layer 1: Port Monitoring

The simplest check — is the database port accepting connections? This catches complete database crashes immediately.

Layer 2: Query Monitoring

Run a simple query (like SELECT 1 or a count on a small table) and measure response time. This catches slow-but-alive scenarios.

Layer 3: Metric Monitoring

Collect and track database-specific metrics. Set alerts on the key indicators listed above.

Layer 4: Application-Level Monitoring

Monitor database health from your application's perspective:

Connection acquisition time
Query execution time per endpoint
Error rates by error type (timeout, connection refused, deadlock)

Prevention Strategies

Connection pooling with appropriate pool sizes
Query optimization as part of your development process
Index maintenance — missing indexes cause slow queries
Capacity planning — know when you'll outgrow your current setup
Automated backups with heartbeat monitoring to confirm they run
Regular maintenance — vacuum (PostgreSQL), optimize (MySQL)

The Simple Starting Point

If you do nothing else:

Port monitor on your database (catches crashes)
Query response time monitor (catches performance issues)
Disk space alert at 80% (prevents the most preventable disaster)
Backup heartbeat (confirms your safety net exists)

Four monitors. Fifteen minutes of setup. The prevention of your most likely outage scenario.

Database Monitoring Essentials: Prevent the Most Common Cause of Outages

Database Monitoring Essentials: Prevent the Most Common Cause of Outages

The Most Common Database Failures

Connection Pool Exhaustion

Slow Queries

Replication Lag

Disk Space

Lock Contention

Backup Failures

What to Monitor by Database Type

PostgreSQL

MySQL

MongoDB

Redis

Setting Up Database Monitoring

Layer 1: Port Monitoring

Layer 2: Query Monitoring

Layer 3: Metric Monitoring

Layer 4: Application-Level Monitoring

Prevention Strategies

The Simple Starting Point

Related articles

Uptime Monitoring vs Observability: Do You Need Both?

Cron Job Monitoring: How to Know When Your Scheduled Tasks Fail

Monitoring Stripe, PayPal, and Payment Gateways: Protect Your Revenue