Automating Incident Response: When Machines Should Fix Things

It's 3 AM. Your monitoring detects that your web server's memory usage hit 95%. An alert fires. The on-call engineer wakes up, logs in, checks the dashboard, and runs the fix: restart the service.

Total human time: 15 minutes. The fix itself: 10 seconds.

Why did we wake someone up for a 10-second fix?

The Case for Automation

Many incidents follow predictable patterns with known solutions:

Memory spike → Restart the service
Connection pool exhausted → Restart the connection pool
Disk usage high → Clean up temp files and old logs
SSL certificate expiring → Trigger auto-renewal
Single server unresponsive → Remove from load balancer, replace

These are perfect candidates for automation. The fix is well-understood, low-risk, and doesn't require human judgment.

What to Automate (And What Not To)

Safe to Automate

Service restarts for known memory leak patterns
Auto-scaling based on traffic or resource thresholds
Failover to standby systems when primary fails health checks
Cache clearing when cache corruption is detected
Log rotation and cleanup for disk space management
Certificate renewal with auto-verification
DNS failover when primary endpoint is unreachable

Keep Human-In-The-Loop

Database operations — Restarts, failovers, and data modifications
Data-loss scenarios — Anything that might lose or corrupt data
Novel failures — Issues that haven't been seen before
Customer-facing changes — Status page updates, customer notifications
Security incidents — Require human judgment and investigation
Multi-system failures — Complex cascading issues need human analysis

Building Automated Remediation

The Pattern

Detect — Monitoring identifies the issue
Classify — Is this a known, automatable issue?
Remediate — Execute the predefined fix
Verify — Confirm the fix worked
Notify — Tell the team what happened (don't wake them up)

Example: Auto-Restart on Memory Spike

Monitor detects memory > 90% for 5 minutes
Automation triggers graceful service restart
Monitor verifies service is healthy after restart
If healthy: Log the event, send a Slack message
If still unhealthy: Escalate to human (this is a novel failure)

Safety Guardrails

Automation without guardrails is dangerous. Always include:

Rate limiting — Don't restart a service more than 3 times per hour
Cooldown periods — Wait 5 minutes between automated actions
Escalation triggers — If automation fails to fix the issue, alert a human
Audit logging — Record every automated action for review
Kill switch — Ability to disable automation instantly

The Progression

Most teams should automate incrementally:

Level 1: Automated Detection + Human Response

Monitoring detects issues and alerts the right person. This is where most teams start.

Level 2: Automated Detection + Suggested Action

Alerts include the likely fix: "Memory spike detected. Suggested action: restart web-service-1. Run: kubectl rollout restart deployment/web"

Level 3: Automated Detection + Automated Response + Human Verification

The system fixes known issues automatically and notifies humans after the fact.

Level 4: Full Automation with Escalation

Known issues are handled entirely by automation. Humans are only involved for novel failures.

Measuring Automation Effectiveness

Automated resolution rate — What percentage of incidents are resolved without human intervention?
False automation rate — How often does automation take action unnecessarily?
Automation failure rate — How often does automated remediation fail?
Time savings — Human hours saved per month
MTTR improvement — How much faster are automated resolutions?

The goal isn't to eliminate humans from incident response. It's to free them from the routine so they can focus on the complex, novel, and interesting problems.

Automating Incident Response: When Machines Should Fix Things

Automating Incident Response: When Machines Should Fix Things

The Case for Automation

What to Automate (And What Not To)

Safe to Automate

Keep Human-In-The-Loop

Building Automated Remediation

The Pattern

Example: Auto-Restart on Memory Spike

Safety Guardrails

The Progression

Level 1: Automated Detection + Human Response

Level 2: Automated Detection + Suggested Action

Level 3: Automated Detection + Automated Response + Human Verification

Level 4: Full Automation with Escalation

Measuring Automation Effectiveness

Related articles

Uptime Monitoring vs Observability: Do You Need Both?

Incident Management Playbook: From Alert to Resolution in Minutes

Monitoring Docker Containers: What Breaks and How to Catch It