Monitoring Docker Containers: What Breaks and How to Catch It
Containers crash, restart, run out of memory, and fail health checks — all while your orchestrator tries to hide the problem. Here's how to maintain visibility.
Monitoring Docker Containers: What Breaks and How to Catch It
Containers are supposed to make operations simpler. And they do — until they don't. The abstraction that makes containers powerful also makes them harder to monitor. Problems that would be obvious on a bare-metal server can be invisible inside a container.
How Containers Fail Differently
Silent Restarts
When a container crashes, your orchestrator (Docker Compose, Kubernetes, ECS) automatically restarts it. From the outside, the service might appear healthy — it's running, right? But each restart means a brief outage, lost in-flight requests, and potentially lost state.
OOMKills
Containers have memory limits. When they exceed those limits, the kernel kills the process — no graceful shutdown, no error logging, no warning. The orchestrator restarts it and the cycle continues.
Zombie Containers
The container is running. The process inside is alive. But it's not actually doing anything useful — stuck in a deadlock, waiting on a resource that will never arrive, or caught in an infinite loop that consumes CPU but produces no results.
Image Pull Failures
A new deployment requires pulling an updated image. If the registry is down or the image tag doesn't exist, the container can't start. The old container might keep running (good) or might have been stopped first (bad).
Networking Issues
Containers communicate over virtual networks. DNS resolution between containers can fail. Network policies can block traffic. Load balancer health checks might pass even when the application is broken.
What to Monitor
Container-Level Metrics
- Restart count — A container that keeps restarting has a problem, even if it's "running"
- Memory usage vs. limit — How close are you to an OOMKill?
- CPU usage — Sustained high CPU might indicate a loop or inefficiency
- Network I/O — Sudden drops might indicate connectivity issues
Application-Level Health
- HTTP health checks — Not just "is the port open" but "does the app respond correctly"
- Readiness checks — Can the container actually serve traffic? (Database connected, cache warmed, etc.)
- Custom health endpoints — Return detailed status including dependency health
Orchestrator-Level Monitoring
- Pod/task status — Running, pending, failed, evicted
- Deployment rollout status — Is the deployment stuck?
- Resource pressure — Are nodes running out of CPU, memory, or disk?
- Scheduling failures — Can new containers be placed?
Practical Monitoring Setup
Layer 1: External HTTP Monitoring
Monitor your containerized services from outside your cluster:
- HTTP checks on exposed endpoints
- Keyword validation to confirm correct responses
- Response time tracking
This is your user-perspective view. If this is green, users are happy regardless of what's happening internally.
Layer 2: Container Health Checks
Configure proper health checks in your container definitions:
Docker Compose:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
Layer 3: Log Monitoring
Containerized applications should log to stdout/stderr. Aggregate these logs and monitor for:
- Error rate spikes
- OOMKill messages
- Connection refused errors
- Timeout patterns
Layer 4: Restart Monitoring
Track container restart counts. Alert when:
- Any container restarts more than 3 times in 10 minutes
- Total cluster restarts exceed normal baseline
Common Pitfalls
Health Checks That Lie
A health check that only returns 200 without actually testing the application is worse than no health check — it provides false confidence. Test real functionality: database connectivity, cache access, critical dependencies.
Monitoring Only the Load Balancer
The load balancer says all backends are healthy. But it's only checking TCP port availability, not application health. Add HTTP-level health checks with content validation.
Ignoring Resource Limits
Running without memory/CPU limits means one runaway container can starve others. Set limits AND monitor usage relative to those limits.
The Minimum Container Monitoring Setup
- External HTTP monitoring on all exposed services (30-second intervals)
- Proper health checks in every container definition
- Restart count monitoring with alerts
- Memory usage monitoring (alert at 85% of limit)
- Log aggregation with error rate tracking
Containers make deployment easy. Don't let them make monitoring hard.
Written by
UptimeGuard Team
Related articles
Uptime Monitoring vs Observability: Do You Need Both?
Monitoring tells you something is broken. Observability tells you why. Understanding the difference helps you invest in the right tools at the right time.
Read moreCron Job Monitoring: How to Know When Your Scheduled Tasks Fail
Cron jobs fail silently. Backups don't run, reports don't send, data doesn't sync — and nobody notices for days. Here's how heartbeat monitoring fixes that.
Read moreMonitoring Stripe, PayPal, and Payment Gateways: Protect Your Revenue
Every minute your payment processing is down, you're losing real money. Here's exactly how to monitor payment gateways to catch failures before your revenue does.
Read more