Debugging Intermittent 502 Errors: A War Story

Week One: The Mystery Begins

The alert came in at 3:12 AM on a Tuesday. Two of our monitors reported 502 errors for about 90 seconds, then everything recovered. No deployment had happened. No config changes. The logs showed nothing unusual.

We chalked it up to a transient network issue and went back to sleep.

The following Tuesday, same thing. 3:14 AM. Same two monitors. Same 90-second window.

Week Two: Down the Rabbit Hole

Now we were paying attention. We added more monitors, decreased check intervals to 10 seconds, and set up a packet capture on the load balancer.

What we found:

The 502s only happened on connections to two specific backend servers
Those servers were healthy — they responded fine to direct requests
The load balancer health checks never flagged them

We checked cron jobs. Nothing ran at 3 AM. We checked log rotation — that happened at midnight, not 3 AM. We checked certificate renewal — nope.

Then someone asked: "What about the database backup?"

The Root Cause

Our database backup ran at 2:45 AM. It did not cause any issues directly. But it triggered a spike in I/O that caused our connection pooler to briefly max out its file descriptors. The pooler did not crash — it just stopped accepting new connections for about 60-90 seconds.

The two backend servers that shared a connection pooler instance were the ones returning 502s. The load balancer health check used a persistent connection that was already established, so it never noticed.

The Fix

Three changes:

Increased file descriptor limits on the connection pooler from 1024 to 65535
Changed health checks to create new connections instead of reusing persistent ones
Added a monitor specifically for the connection pooler's file descriptor usage

Total cost of the investigation: about 20 engineering hours across three weeks.

Lessons Learned

Intermittent issues are almost never random. If it happens on a schedule, something else runs on that schedule.
Health checks that use persistent connections can mask failures. Always test the full connection path.
Monitor your monitoring. If your health check cannot catch a failure mode, it is not checking the right thing.
Low-level metrics matter. File descriptors, connection pool sizes, kernel parameters — these are the things that bite you at 3 AM.