How to Build an Effective On-Call Runbook
A good runbook turns a panicked 3 AM incident into a calm, step-by-step resolution. Here's how to write runbooks your team will actually use.
How to Build an Effective On-Call Runbook
At 3 AM, your on-call engineer gets paged. They're groggy, stressed, and working with half their usual cognitive capacity. This is not the time for problem-solving from scratch.
A runbook gives them a clear, step-by-step path from alert to resolution.
What Makes a Good Runbook
Specific, Not General
Bad: "Check the database"
Good: "Run SELECT count(*) FROM pg_stat_activity WHERE state = 'active' — if count > 90, proceed to Step 3"
Sequenced by Probability
List the most likely cause first. Don't make the engineer work through 10 rare scenarios before reaching the one that happens 80% of the time.
Copy-Pasteable Commands
Every diagnostic or fix command should be ready to copy and paste. No one should be constructing complex commands at 3 AM.
Includes Escalation Criteria
Define when to stop following the runbook and escalate to someone else.
Runbook Template
Title: [Alert Name] Runbook
Last Updated: [Date] Author: [Name] Alert: [What triggers this runbook] Severity: [P1/P2/P3]
Quick Assessment (2 minutes)
- Check the monitoring dashboard: [link]
- Verify the alert is real (not a false positive)
- Check if there was a recent deployment: [link to deploy log]
- Determine severity and communicate in #incidents
Most Likely Cause: [Cause A] (70% of cases)
Symptoms: [What you'll see] Diagnosis: [Command to confirm] Fix: [Exact steps to resolve] Verify: [How to confirm it's fixed]
Second Most Likely: [Cause B] (20% of cases)
Symptoms: [What you'll see] Diagnosis: [Command to confirm] Fix: [Exact steps to resolve] Verify: [How to confirm it's fixed]
Less Common: [Cause C] (10% of cases)
Symptoms: [What you'll see] Diagnosis: [Command to confirm] Fix: [Exact steps to resolve] Verify: [How to confirm it's fixed]
Escalation
If none of the above resolves the issue within [X minutes], escalate:
- Primary: [Name, phone number]
- Secondary: [Name, phone number]
- Management: [Name, phone number]
Post-Resolution
- Verify monitoring shows recovery
- Update status page
- Post summary in #incidents
- Schedule post-mortem if P1/P2
Keeping Runbooks Alive
After Every Incident
- Did the runbook help? What was missing?
- Update the runbook with any new information
- Add new causes discovered during the incident
Monthly Review
- Are all runbooks still accurate?
- Have services or commands changed?
- Are escalation contacts current?
Make Them Findable
- Link runbooks directly from alert messages
- Pin them in relevant Slack channels
- Store in a searchable, always-accessible location (not behind VPN if possible)
The best runbook is one you hope to never need but are grateful to have at 3 AM.
Written by
UptimeGuard Team
Related articles
Uptime Monitoring vs Observability: Do You Need Both?
Monitoring tells you something is broken. Observability tells you why. Understanding the difference helps you invest in the right tools at the right time.
Read moreIncident Management Playbook: From Alert to Resolution in Minutes
A practical, step-by-step incident management playbook your team can adopt today. No enterprise complexity — just clear processes that work.
Read morePost-Mortem Template: How to Learn from Every Incident
The most valuable part of any incident isn't the fix — it's the post-mortem. Here's a battle-tested template and process that turns outages into improvements.
Read more