Monitoring Redis: Prevent Cache Failures That Cascade Into Outages
When Redis goes down, your database gets hammered, response times spike, and your entire application crumbles. Here's how to monitor Redis before it takes everything down.
Monitoring Redis: Prevent Cache Failures That Cascade Into Outages
Redis is often the silent hero of your architecture. It caches database queries, stores sessions, manages rate limits, powers real-time features, and handles message queues. It works so well that teams forget it's there.
Until it fails. And when Redis fails, everything behind it fails too.
The Redis Cascade Effect
Here's what typically happens when Redis goes down:
- Cache misses spike — Every request that normally hits Redis now hits your database
- Database overloads — It's suddenly handling 10-100x more queries than normal
- Response times spike — 50ms responses become 5-second responses
- Connection pools exhaust — Database connections run out
- Application errors — Services start returning 500 errors
- Total outage — The application becomes unusable
The entire cascade can unfold in under 60 seconds.
What to Monitor
Connection Health
- Port 6379 availability — Basic TCP check, catches complete crashes
- Connected clients vs. maxclients — Alert at 80% to prevent connection refusal
- Rejected connections — Any rejected connection is a problem
- Connection rate — Sudden spikes might indicate a connection leak
Memory
- Used memory vs. maxmemory — Alert at 85%
- Memory fragmentation ratio — Should be close to 1.0; high values waste memory
- Evicted keys — If keys are being evicted, you're at capacity
- Expired keys — Normal, but sudden spikes might indicate unusual patterns
Performance
- Command latency — P50, P95, P99 latency for key operations
- Operations per second — Baseline and alerting on anomalies
- Hit rate — keyspace_hits / (keyspace_hits + keyspace_misses). Below 90% usually means problems
- Slow log entries — Commands taking longer than the slow-log threshold
Persistence
- Last RDB save status — Was the last snapshot successful?
- Last RDB save time — How long since the last successful save?
- AOF status — Is the append-only file being written?
- AOF rewrite status — Is the background rewrite completing?
Replication (If Using Replicas)
- Connected slaves — Are all replicas connected?
- Replication lag — How far behind are replicas?
- Master link status — Is the replica connected to the master?
Setting Up Redis Monitoring
Layer 1: External Port Check
Monitor port 6379 (or your custom port) every 30 seconds. This catches complete Redis crashes immediately.
Layer 2: Application-Level Check
Create a health endpoint in your application that:
- Writes a key to Redis (SET health_check timestamp)
- Reads it back (GET health_check)
- Returns the round-trip time
Monitor this endpoint. It catches connectivity issues, authentication problems, and performance degradation.
Layer 3: Redis Metrics
Collect Redis INFO output and track key metrics. Most monitoring tools can parse Redis INFO directly.
Common Redis Failure Scenarios
Memory Exhaustion
Redis has a maxmemory setting. When reached, behavior depends on the eviction policy:
- noeviction: Returns errors on write commands (safest but breaks things)
- allkeys-lru: Evicts least recently used keys (data loss, but stays functional)
Monitor memory usage and alert well before maxmemory.
Slow Commands Blocking Everything
Redis is single-threaded. One slow command (like KEYS * on a large dataset) blocks ALL other commands. Monitor for slow log entries and alert on any command taking >100ms.
Persistence Fork Failure
Redis forks the process for RDB snapshots and AOF rewrites. On systems with limited memory, the fork can fail, stopping persistence silently.
Network Partition
If Redis becomes unreachable due to a network issue, your application might continue running but with degraded performance (falling back to database for every request).
The Minimum Redis Monitoring Setup
- Port monitor on Redis port (30-second interval)
- Application health check that reads/writes Redis (60-second interval)
- Memory usage alert at 85% of maxmemory
- Connected clients alert at 80% of maxclients
- Hit rate alert below 90%
Five monitors. The prevention of your most likely cascade failure.
Written by
UptimeGuard Team
Related articles
Uptime Monitoring vs Observability: Do You Need Both?
Monitoring tells you something is broken. Observability tells you why. Understanding the difference helps you invest in the right tools at the right time.
Read moreCron Job Monitoring: How to Know When Your Scheduled Tasks Fail
Cron jobs fail silently. Backups don't run, reports don't send, data doesn't sync — and nobody notices for days. Here's how heartbeat monitoring fixes that.
Read moreMonitoring Stripe, PayPal, and Payment Gateways: Protect Your Revenue
Every minute your payment processing is down, you're losing real money. Here's exactly how to monitor payment gateways to catch failures before your revenue does.
Read more