Incident Retrospective: Our Worst Outage and What We Learned

We believe in practicing what we preach. When we experienced our longest outage, we committed to full transparency — not just with our customers, but with the broader community.

This is our most painful incident, dissected completely.

The Incident

Date: A Thursday in late Q3 Duration: 2 hours 14 minutes Impact: Monitoring checks delayed, some alerts not delivered for 47 minutes Root Cause: Database migration during routine maintenance caused unexpected lock contention

Timeline

14:00 — Planned maintenance begins: database schema migration to improve query performance
14:08 — Migration triggers unexpected table-level lock on the checks table
14:09 — Check execution queue begins backing up
14:11 — Our own monitoring detects the queue backup (yes, we monitor our monitoring)
14:12 — Alert fires to the engineering team
14:15 — Team assembled, begins investigation
14:22 — Root cause identified: migration holding exclusive lock
14:25 — Decision: can't kill migration (risk of corruption), must wait for completion
14:30 — Status page updated: "Monitoring checks may be delayed"
14:45 — Customer notification sent via email
15:37 — Migration completes, locks released
15:42 — Check queue fully caught up
15:50 — All systems verified healthy
16:00 — Status page updated: "Resolved"
16:14 — Detailed update sent to all customers

What Went Wrong

Migration wasn't tested at production scale. Our staging database had 10% of production data. The migration that took 3 minutes in staging took 89 minutes in production.
We didn't anticipate the locking behavior. The migration used ALTER TABLE which acquired an exclusive lock.
No kill switch for the migration. Once started, we couldn't safely abort without risking data corruption.
Alert delivery was affected. Because our check execution was delayed, some alerts for customer sites that went down during the incident were also delayed.

What Went Right

Self-monitoring caught it fast. We detected the problem within 2 minutes.
Communication was transparent. Status page updated within 18 minutes, customer email within 33 minutes.
No data loss. Despite the lock contention, no monitoring data was lost — just delayed.
Team response was fast and coordinated.

The 14 Changes We Made

All migrations must be tested against a production-sized dataset
Migrations must use online DDL tools (no exclusive table locks)
Migration rollback plans required and tested before execution
Maintenance windows require explicit "go/no-go" checklist
Alert delivery system separated from check execution system
Added secondary alert delivery path (redundancy)
Database migration runbook created and documented
Added queue depth monitoring with separate alerting
Implemented circuit breaker on check execution
Added automated customer notification for extended incidents
Created "migration staging" environment with production-scale data
Added lock monitoring to database health checks
Implemented gradual migration execution (batched operations)
Monthly game day testing of alert delivery during degraded state

The Takeaway

We build a monitoring product that helps teams catch and resolve incidents quickly. This incident reminded us that we're not immune to the same challenges our customers face.

The best response to a failure isn't perfection — it's learning. Every one of these 14 changes makes our platform more reliable for the customers who trust us with their monitoring.

We're sorry for the disruption. And we're committed to earning that trust back through action, not just words.

Incident Retrospective: Our Worst Outage and What We Learned

Incident Retrospective: Our Worst Outage and What We Learned

The Incident

Timeline

What Went Wrong

What Went Right

The 14 Changes We Made

The Takeaway

Related articles

Incident Management Playbook: From Alert to Resolution in Minutes

Post-Mortem Template: How to Learn from Every Incident

How to Write an Incident Communication That Doesn't Make Things Worse