How to Build an Effective On-Call Runbook

At 3 AM, your on-call engineer gets paged. They're groggy, stressed, and working with half their usual cognitive capacity. This is not the time for problem-solving from scratch.

A runbook gives them a clear, step-by-step path from alert to resolution.

What Makes a Good Runbook

Specific, Not General

Bad: "Check the database" Good: "Run SELECT count(*) FROM pg_stat_activity WHERE state = 'active' — if count > 90, proceed to Step 3"

Sequenced by Probability

List the most likely cause first. Don't make the engineer work through 10 rare scenarios before reaching the one that happens 80% of the time.

Copy-Pasteable Commands

Every diagnostic or fix command should be ready to copy and paste. No one should be constructing complex commands at 3 AM.

Includes Escalation Criteria

Define when to stop following the runbook and escalate to someone else.

Runbook Template

Title: [Alert Name] Runbook

Last Updated: [Date] Author: [Name] Alert: [What triggers this runbook] Severity: [P1/P2/P3]

Quick Assessment (2 minutes)

Check the monitoring dashboard: [link]
Verify the alert is real (not a false positive)
Check if there was a recent deployment: [link to deploy log]
Determine severity and communicate in #incidents

Most Likely Cause: [Cause A] (70% of cases)

Symptoms: [What you'll see] Diagnosis: [Command to confirm] Fix: [Exact steps to resolve] Verify: [How to confirm it's fixed]

Second Most Likely: [Cause B] (20% of cases)

Symptoms: [What you'll see] Diagnosis: [Command to confirm] Fix: [Exact steps to resolve] Verify: [How to confirm it's fixed]

Less Common: [Cause C] (10% of cases)

Symptoms: [What you'll see] Diagnosis: [Command to confirm] Fix: [Exact steps to resolve] Verify: [How to confirm it's fixed]

Escalation

If none of the above resolves the issue within [X minutes], escalate:

Primary: [Name, phone number]
Secondary: [Name, phone number]
Management: [Name, phone number]

Post-Resolution

Verify monitoring shows recovery
Update status page
Post summary in #incidents
Schedule post-mortem if P1/P2

Keeping Runbooks Alive

After Every Incident

Did the runbook help? What was missing?
Update the runbook with any new information
Add new causes discovered during the incident

Monthly Review

Are all runbooks still accurate?
Have services or commands changed?
Are escalation contacts current?

Make Them Findable

Link runbooks directly from alert messages
Pin them in relevant Slack channels
Store in a searchable, always-accessible location (not behind VPN if possible)

The best runbook is one you hope to never need but are grateful to have at 3 AM.

How to Build an Effective On-Call Runbook

How to Build an Effective On-Call Runbook

What Makes a Good Runbook

Specific, Not General

Sequenced by Probability

Copy-Pasteable Commands

Includes Escalation Criteria

Runbook Template

Title: [Alert Name] Runbook

Quick Assessment (2 minutes)

Most Likely Cause: [Cause A] (70% of cases)

Second Most Likely: [Cause B] (20% of cases)

Less Common: [Cause C] (10% of cases)

Escalation

Post-Resolution

Keeping Runbooks Alive

After Every Incident

Monthly Review

Make Them Findable

Related articles

Uptime Monitoring vs Observability: Do You Need Both?

Incident Management Playbook: From Alert to Resolution in Minutes

Post-Mortem Template: How to Learn from Every Incident