uptimeMonitoruptimeMonitor
Back to Blog
Monitoring

Database Monitoring Essentials: Prevent the Most Common Cause of Outages

Database issues cause more application outages than anything else. Connection pool exhaustion, slow queries, replication lag — here's how to catch them early.

UT
UptimeGuard Team
January 22, 202610 min read5,680 views
Share
databasepostgresqlmysqlmongodbredismonitoring

Database Monitoring Essentials: Prevent the Most Common Cause of Outages

Ask any SRE team what causes the most outages and they'll likely say the same thing: the database. Connection pool exhaustion, slow queries, replication lag, disk space — databases have more failure modes than any other component in your stack.

The good news? Most database failures are predictable and preventable with the right monitoring.

The Most Common Database Failures

Connection Pool Exhaustion

Your application has a limited number of database connections. When they're all in use, new requests queue up and eventually timeout. Users see slow responses followed by errors.

Monitor: Active connections vs. max connections. Alert at 80% utilization.

Slow Queries

One poorly optimized query can bring down an entire application. A query that worked fine with 10,000 rows becomes a monster with 10 million.

Monitor: Query execution time P95. Alert when it exceeds 2x your baseline.

Replication Lag

If you use read replicas, lag means users see stale data. In extreme cases, they might see data that appears to go backward in time.

Monitor: Seconds of replication lag. Alert above 5 seconds for most applications.

Disk Space

Databases need disk space for data, indexes, transaction logs, and temporary files. Running out of disk space causes write failures and potential corruption.

Monitor: Disk usage percentage. Alert at 80%. Take action at 90%.

Lock Contention

Multiple transactions competing for the same rows create locks. Excessive locking slows everything down and can cause deadlocks that require manual intervention.

Monitor: Lock wait time and deadlock frequency.

Backup Failures

Not a real-time issue, but discovering your backups haven't been running when you need to restore is catastrophic.

Monitor: Backup completion via heartbeat monitoring. Alert if the daily backup heartbeat is missed.

What to Monitor by Database Type

PostgreSQL

  • Active connections / max_connections
  • Transaction rate (TPS)
  • Tuple operations (inserts, updates, deletes)
  • Cache hit ratio (should be >99%)
  • Replication lag (if using replicas)
  • Table bloat and vacuum activity
  • Disk usage

MySQL

  • Threads connected / max_connections
  • Queries per second
  • Slow query count
  • InnoDB buffer pool hit ratio
  • Replication lag (Seconds_Behind_Master)
  • Binary log disk usage

MongoDB

  • Current connections / max connections
  • Operations per second (opcounters)
  • Document operations (query, insert, update, delete)
  • Replication oplog window
  • WiredTiger cache usage
  • Lock percentage

Redis

  • Connected clients / maxclients
  • Memory usage / maxmemory
  • Hit rate (keyspace_hits / total)
  • Evicted keys (should be 0 if possible)
  • Replication lag
  • RDB/AOF last save status

Setting Up Database Monitoring

Layer 1: Port Monitoring

The simplest check — is the database port accepting connections? This catches complete database crashes immediately.

Layer 2: Query Monitoring

Run a simple query (like SELECT 1 or a count on a small table) and measure response time. This catches slow-but-alive scenarios.

Layer 3: Metric Monitoring

Collect and track database-specific metrics. Set alerts on the key indicators listed above.

Layer 4: Application-Level Monitoring

Monitor database health from your application's perspective:

  • Connection acquisition time
  • Query execution time per endpoint
  • Error rates by error type (timeout, connection refused, deadlock)

Prevention Strategies

  1. Connection pooling with appropriate pool sizes
  2. Query optimization as part of your development process
  3. Index maintenance — missing indexes cause slow queries
  4. Capacity planning — know when you'll outgrow your current setup
  5. Automated backups with heartbeat monitoring to confirm they run
  6. Regular maintenance — vacuum (PostgreSQL), optimize (MySQL)

The Simple Starting Point

If you do nothing else:

  1. Port monitor on your database (catches crashes)
  2. Query response time monitor (catches performance issues)
  3. Disk space alert at 80% (prevents the most preventable disaster)
  4. Backup heartbeat (confirms your safety net exists)

Four monitors. Fifteen minutes of setup. The prevention of your most likely outage scenario.

Share
UT

Written by

UptimeGuard Team

Related articles