uptimeMonitoruptimeMonitor
Back to Blog
Best Practices

How to Build an Effective On-Call Runbook

A good runbook turns a panicked 3 AM incident into a calm, step-by-step resolution. Here's how to write runbooks your team will actually use.

UT
UptimeGuard Team
February 8, 20268 min read5,350 views
Share
runbookon-callincident-responsedocumentationsre

How to Build an Effective On-Call Runbook

At 3 AM, your on-call engineer gets paged. They're groggy, stressed, and working with half their usual cognitive capacity. This is not the time for problem-solving from scratch.

A runbook gives them a clear, step-by-step path from alert to resolution.

What Makes a Good Runbook

Specific, Not General

Bad: "Check the database" Good: "Run SELECT count(*) FROM pg_stat_activity WHERE state = 'active' — if count > 90, proceed to Step 3"

Sequenced by Probability

List the most likely cause first. Don't make the engineer work through 10 rare scenarios before reaching the one that happens 80% of the time.

Copy-Pasteable Commands

Every diagnostic or fix command should be ready to copy and paste. No one should be constructing complex commands at 3 AM.

Includes Escalation Criteria

Define when to stop following the runbook and escalate to someone else.

Runbook Template

Title: [Alert Name] Runbook

Last Updated: [Date] Author: [Name] Alert: [What triggers this runbook] Severity: [P1/P2/P3]

Quick Assessment (2 minutes)

  1. Check the monitoring dashboard: [link]
  2. Verify the alert is real (not a false positive)
  3. Check if there was a recent deployment: [link to deploy log]
  4. Determine severity and communicate in #incidents

Most Likely Cause: [Cause A] (70% of cases)

Symptoms: [What you'll see] Diagnosis: [Command to confirm] Fix: [Exact steps to resolve] Verify: [How to confirm it's fixed]

Second Most Likely: [Cause B] (20% of cases)

Symptoms: [What you'll see] Diagnosis: [Command to confirm] Fix: [Exact steps to resolve] Verify: [How to confirm it's fixed]

Less Common: [Cause C] (10% of cases)

Symptoms: [What you'll see] Diagnosis: [Command to confirm] Fix: [Exact steps to resolve] Verify: [How to confirm it's fixed]

Escalation

If none of the above resolves the issue within [X minutes], escalate:

  • Primary: [Name, phone number]
  • Secondary: [Name, phone number]
  • Management: [Name, phone number]

Post-Resolution

  1. Verify monitoring shows recovery
  2. Update status page
  3. Post summary in #incidents
  4. Schedule post-mortem if P1/P2

Keeping Runbooks Alive

After Every Incident

  • Did the runbook help? What was missing?
  • Update the runbook with any new information
  • Add new causes discovered during the incident

Monthly Review

  • Are all runbooks still accurate?
  • Have services or commands changed?
  • Are escalation contacts current?

Make Them Findable

  • Link runbooks directly from alert messages
  • Pin them in relevant Slack channels
  • Store in a searchable, always-accessible location (not behind VPN if possible)

The best runbook is one you hope to never need but are grateful to have at 3 AM.

Share
UT

Written by

UptimeGuard Team

Related articles