Debugging Intermittent 502 Errors: A War Story
It took us three weeks and a packet capture to find why our load balancer returned 502s every Tuesday at 3 AM. This is the full timeline.
Week One: The Mystery Begins
The alert came in at 3:12 AM on a Tuesday. Two of our monitors reported 502 errors for about 90 seconds, then everything recovered. No deployment had happened. No config changes. The logs showed nothing unusual.
We chalked it up to a transient network issue and went back to sleep.
The following Tuesday, same thing. 3:14 AM. Same two monitors. Same 90-second window.
Week Two: Down the Rabbit Hole
Now we were paying attention. We added more monitors, decreased check intervals to 10 seconds, and set up a packet capture on the load balancer.
What we found:
- The 502s only happened on connections to two specific backend servers
- Those servers were healthy — they responded fine to direct requests
- The load balancer health checks never flagged them
We checked cron jobs. Nothing ran at 3 AM. We checked log rotation — that happened at midnight, not 3 AM. We checked certificate renewal — nope.
Then someone asked: "What about the database backup?"
The Root Cause
Our database backup ran at 2:45 AM. It did not cause any issues directly. But it triggered a spike in I/O that caused our connection pooler to briefly max out its file descriptors. The pooler did not crash — it just stopped accepting new connections for about 60-90 seconds.
The two backend servers that shared a connection pooler instance were the ones returning 502s. The load balancer health check used a persistent connection that was already established, so it never noticed.
The Fix
Three changes:
- Increased file descriptor limits on the connection pooler from 1024 to 65535
- Changed health checks to create new connections instead of reusing persistent ones
- Added a monitor specifically for the connection pooler's file descriptor usage
Total cost of the investigation: about 20 engineering hours across three weeks.
Lessons Learned
- Intermittent issues are almost never random. If it happens on a schedule, something else runs on that schedule.
- Health checks that use persistent connections can mask failures. Always test the full connection path.
- Monitor your monitoring. If your health check cannot catch a failure mode, it is not checking the right thing.
- Low-level metrics matter. File descriptors, connection pool sizes, kernel parameters — these are the things that bite you at 3 AM.
Written by
Marcus Johnson
SRE Lead. 10+ years in infrastructure and reliability.
Related articles
Incident Management Playbook: From Alert to Resolution in Minutes
A practical, step-by-step incident management playbook your team can adopt today. No enterprise complexity — just clear processes that work.
Read morePost-Mortem Template: How to Learn from Every Incident
The most valuable part of any incident isn't the fix — it's the post-mortem. Here's a battle-tested template and process that turns outages into improvements.
Read moreIncident Retrospective: Our Worst Outage and What We Learned
Complete transparency about our longest outage — the timeline, the root cause, what failed, and the 14 changes we made to ensure it never happens again.
Read more