Incident Retrospective: Our Worst Outage and What We Learned
Complete transparency about our longest outage — the timeline, the root cause, what failed, and the 14 changes we made to ensure it never happens again.
Incident Retrospective: Our Worst Outage and What We Learned
We believe in practicing what we preach. When we experienced our longest outage, we committed to full transparency — not just with our customers, but with the broader community.
This is our most painful incident, dissected completely.
The Incident
Date: A Thursday in late Q3 Duration: 2 hours 14 minutes Impact: Monitoring checks delayed, some alerts not delivered for 47 minutes Root Cause: Database migration during routine maintenance caused unexpected lock contention
Timeline
- 14:00 — Planned maintenance begins: database schema migration to improve query performance
- 14:08 — Migration triggers unexpected table-level lock on the checks table
- 14:09 — Check execution queue begins backing up
- 14:11 — Our own monitoring detects the queue backup (yes, we monitor our monitoring)
- 14:12 — Alert fires to the engineering team
- 14:15 — Team assembled, begins investigation
- 14:22 — Root cause identified: migration holding exclusive lock
- 14:25 — Decision: can't kill migration (risk of corruption), must wait for completion
- 14:30 — Status page updated: "Monitoring checks may be delayed"
- 14:45 — Customer notification sent via email
- 15:37 — Migration completes, locks released
- 15:42 — Check queue fully caught up
- 15:50 — All systems verified healthy
- 16:00 — Status page updated: "Resolved"
- 16:14 — Detailed update sent to all customers
What Went Wrong
-
Migration wasn't tested at production scale. Our staging database had 10% of production data. The migration that took 3 minutes in staging took 89 minutes in production.
-
We didn't anticipate the locking behavior. The migration used ALTER TABLE which acquired an exclusive lock.
-
No kill switch for the migration. Once started, we couldn't safely abort without risking data corruption.
-
Alert delivery was affected. Because our check execution was delayed, some alerts for customer sites that went down during the incident were also delayed.
What Went Right
- Self-monitoring caught it fast. We detected the problem within 2 minutes.
- Communication was transparent. Status page updated within 18 minutes, customer email within 33 minutes.
- No data loss. Despite the lock contention, no monitoring data was lost — just delayed.
- Team response was fast and coordinated.
The 14 Changes We Made
- All migrations must be tested against a production-sized dataset
- Migrations must use online DDL tools (no exclusive table locks)
- Migration rollback plans required and tested before execution
- Maintenance windows require explicit "go/no-go" checklist
- Alert delivery system separated from check execution system
- Added secondary alert delivery path (redundancy)
- Database migration runbook created and documented
- Added queue depth monitoring with separate alerting
- Implemented circuit breaker on check execution
- Added automated customer notification for extended incidents
- Created "migration staging" environment with production-scale data
- Added lock monitoring to database health checks
- Implemented gradual migration execution (batched operations)
- Monthly game day testing of alert delivery during degraded state
The Takeaway
We build a monitoring product that helps teams catch and resolve incidents quickly. This incident reminded us that we're not immune to the same challenges our customers face.
The best response to a failure isn't perfection — it's learning. Every one of these 14 changes makes our platform more reliable for the customers who trust us with their monitoring.
We're sorry for the disruption. And we're committed to earning that trust back through action, not just words.
Written by
UptimeGuard Team
Related articles
Incident Management Playbook: From Alert to Resolution in Minutes
A practical, step-by-step incident management playbook your team can adopt today. No enterprise complexity — just clear processes that work.
Read morePost-Mortem Template: How to Learn from Every Incident
The most valuable part of any incident isn't the fix — it's the post-mortem. Here's a battle-tested template and process that turns outages into improvements.
Read moreHow to Write an Incident Communication That Doesn't Make Things Worse
Bad incident communications can cause more damage than the outage itself. Here's how to write updates that inform, reassure, and actually help your customers.
Read more