uptimeMonitoruptimeMonitor
Back to Blog
Incidents

Incident Retrospective: Our Worst Outage and What We Learned

Complete transparency about our longest outage — the timeline, the root cause, what failed, and the 14 changes we made to ensure it never happens again.

UT
UptimeGuard Team
March 2, 20269 min read8,468 views
Share
post-mortemtransparencyincidentdatabaselessons-learned

Incident Retrospective: Our Worst Outage and What We Learned

We believe in practicing what we preach. When we experienced our longest outage, we committed to full transparency — not just with our customers, but with the broader community.

This is our most painful incident, dissected completely.

The Incident

Date: A Thursday in late Q3 Duration: 2 hours 14 minutes Impact: Monitoring checks delayed, some alerts not delivered for 47 minutes Root Cause: Database migration during routine maintenance caused unexpected lock contention

Timeline

  • 14:00 — Planned maintenance begins: database schema migration to improve query performance
  • 14:08 — Migration triggers unexpected table-level lock on the checks table
  • 14:09 — Check execution queue begins backing up
  • 14:11 — Our own monitoring detects the queue backup (yes, we monitor our monitoring)
  • 14:12 — Alert fires to the engineering team
  • 14:15 — Team assembled, begins investigation
  • 14:22 — Root cause identified: migration holding exclusive lock
  • 14:25 — Decision: can't kill migration (risk of corruption), must wait for completion
  • 14:30 — Status page updated: "Monitoring checks may be delayed"
  • 14:45 — Customer notification sent via email
  • 15:37 — Migration completes, locks released
  • 15:42 — Check queue fully caught up
  • 15:50 — All systems verified healthy
  • 16:00 — Status page updated: "Resolved"
  • 16:14 — Detailed update sent to all customers

What Went Wrong

  1. Migration wasn't tested at production scale. Our staging database had 10% of production data. The migration that took 3 minutes in staging took 89 minutes in production.

  2. We didn't anticipate the locking behavior. The migration used ALTER TABLE which acquired an exclusive lock.

  3. No kill switch for the migration. Once started, we couldn't safely abort without risking data corruption.

  4. Alert delivery was affected. Because our check execution was delayed, some alerts for customer sites that went down during the incident were also delayed.

What Went Right

  1. Self-monitoring caught it fast. We detected the problem within 2 minutes.
  2. Communication was transparent. Status page updated within 18 minutes, customer email within 33 minutes.
  3. No data loss. Despite the lock contention, no monitoring data was lost — just delayed.
  4. Team response was fast and coordinated.

The 14 Changes We Made

  1. All migrations must be tested against a production-sized dataset
  2. Migrations must use online DDL tools (no exclusive table locks)
  3. Migration rollback plans required and tested before execution
  4. Maintenance windows require explicit "go/no-go" checklist
  5. Alert delivery system separated from check execution system
  6. Added secondary alert delivery path (redundancy)
  7. Database migration runbook created and documented
  8. Added queue depth monitoring with separate alerting
  9. Implemented circuit breaker on check execution
  10. Added automated customer notification for extended incidents
  11. Created "migration staging" environment with production-scale data
  12. Added lock monitoring to database health checks
  13. Implemented gradual migration execution (batched operations)
  14. Monthly game day testing of alert delivery during degraded state

The Takeaway

We build a monitoring product that helps teams catch and resolve incidents quickly. This incident reminded us that we're not immune to the same challenges our customers face.

The best response to a failure isn't perfection — it's learning. Every one of these 14 changes makes our platform more reliable for the customers who trust us with their monitoring.

We're sorry for the disruption. And we're committed to earning that trust back through action, not just words.

Share
UT

Written by

UptimeGuard Team

Related articles