Database Monitoring Essentials: Prevent the Most Common Cause of Outages
Database issues cause more application outages than anything else. Connection pool exhaustion, slow queries, replication lag — here's how to catch them early.
Database Monitoring Essentials: Prevent the Most Common Cause of Outages
Ask any SRE team what causes the most outages and they'll likely say the same thing: the database. Connection pool exhaustion, slow queries, replication lag, disk space — databases have more failure modes than any other component in your stack.
The good news? Most database failures are predictable and preventable with the right monitoring.
The Most Common Database Failures
Connection Pool Exhaustion
Your application has a limited number of database connections. When they're all in use, new requests queue up and eventually timeout. Users see slow responses followed by errors.
Monitor: Active connections vs. max connections. Alert at 80% utilization.
Slow Queries
One poorly optimized query can bring down an entire application. A query that worked fine with 10,000 rows becomes a monster with 10 million.
Monitor: Query execution time P95. Alert when it exceeds 2x your baseline.
Replication Lag
If you use read replicas, lag means users see stale data. In extreme cases, they might see data that appears to go backward in time.
Monitor: Seconds of replication lag. Alert above 5 seconds for most applications.
Disk Space
Databases need disk space for data, indexes, transaction logs, and temporary files. Running out of disk space causes write failures and potential corruption.
Monitor: Disk usage percentage. Alert at 80%. Take action at 90%.
Lock Contention
Multiple transactions competing for the same rows create locks. Excessive locking slows everything down and can cause deadlocks that require manual intervention.
Monitor: Lock wait time and deadlock frequency.
Backup Failures
Not a real-time issue, but discovering your backups haven't been running when you need to restore is catastrophic.
Monitor: Backup completion via heartbeat monitoring. Alert if the daily backup heartbeat is missed.
What to Monitor by Database Type
PostgreSQL
- Active connections / max_connections
- Transaction rate (TPS)
- Tuple operations (inserts, updates, deletes)
- Cache hit ratio (should be >99%)
- Replication lag (if using replicas)
- Table bloat and vacuum activity
- Disk usage
MySQL
- Threads connected / max_connections
- Queries per second
- Slow query count
- InnoDB buffer pool hit ratio
- Replication lag (Seconds_Behind_Master)
- Binary log disk usage
MongoDB
- Current connections / max connections
- Operations per second (opcounters)
- Document operations (query, insert, update, delete)
- Replication oplog window
- WiredTiger cache usage
- Lock percentage
Redis
- Connected clients / maxclients
- Memory usage / maxmemory
- Hit rate (keyspace_hits / total)
- Evicted keys (should be 0 if possible)
- Replication lag
- RDB/AOF last save status
Setting Up Database Monitoring
Layer 1: Port Monitoring
The simplest check — is the database port accepting connections? This catches complete database crashes immediately.
Layer 2: Query Monitoring
Run a simple query (like SELECT 1 or a count on a small table) and measure response time. This catches slow-but-alive scenarios.
Layer 3: Metric Monitoring
Collect and track database-specific metrics. Set alerts on the key indicators listed above.
Layer 4: Application-Level Monitoring
Monitor database health from your application's perspective:
- Connection acquisition time
- Query execution time per endpoint
- Error rates by error type (timeout, connection refused, deadlock)
Prevention Strategies
- Connection pooling with appropriate pool sizes
- Query optimization as part of your development process
- Index maintenance — missing indexes cause slow queries
- Capacity planning — know when you'll outgrow your current setup
- Automated backups with heartbeat monitoring to confirm they run
- Regular maintenance — vacuum (PostgreSQL), optimize (MySQL)
The Simple Starting Point
If you do nothing else:
- Port monitor on your database (catches crashes)
- Query response time monitor (catches performance issues)
- Disk space alert at 80% (prevents the most preventable disaster)
- Backup heartbeat (confirms your safety net exists)
Four monitors. Fifteen minutes of setup. The prevention of your most likely outage scenario.
Written by
UptimeGuard Team
Related articles
Uptime Monitoring vs Observability: Do You Need Both?
Monitoring tells you something is broken. Observability tells you why. Understanding the difference helps you invest in the right tools at the right time.
Read moreCron Job Monitoring: How to Know When Your Scheduled Tasks Fail
Cron jobs fail silently. Backups don't run, reports don't send, data doesn't sync — and nobody notices for days. Here's how heartbeat monitoring fixes that.
Read moreMonitoring Stripe, PayPal, and Payment Gateways: Protect Your Revenue
Every minute your payment processing is down, you're losing real money. Here's exactly how to monitor payment gateways to catch failures before your revenue does.
Read more