Use when creating incident response procedures and on-call playbooks. Covers incident management, communication protocols, and post-mortem documentation.
Limited to specific tools
Additional assets for this skill
This skill is limited to using the following tools:
name: runbooks-incident-response description: Use when creating incident response procedures and on-call playbooks. Covers incident management, communication protocols, and post-mortem documentation. allowed-tools:
Creating effective incident response procedures for handling production incidents and on-call scenarios.
SEV-1 (Critical)
SEV-2 (High)
SEV-3 (Medium)
SEV-4 (Low)
# Incident Response: [Alert/Issue Name]
**Severity:** SEV-1/SEV-2/SEV-3/SEV-4
**Response Time:** Immediate / 15 min / 1 hour / Next day
**Owner:** On-call Engineer
## Incident Detection
**This runbook is triggered by:**
- PagerDuty alert: `api_error_rate_high`
- Customer report in #support
- Monitoring dashboard showing anomaly
## Initial Response (First 5 Minutes)
### 1. Acknowledge & Assess
```bash
# Check current status
curl https://api.example.com/health
kubectl get pods -n production
Determine severity:
SEV-1:
/incident create SEV-1 API OutageSEV-2:
SEV-3:
Create incident doc (copy template):
Incident: API Outage
Started: 2025-01-15 14:30 UTC
Severity: SEV-1
Timeline:
14:30 - Alert fired
14:31 - On-call acknowledged
14:32 - Assessed as SEV-1
14:33 - Created incident channel
Goal: Stop the bleeding, restore service
Option A: Rollback Recent Deploy
# Check recent deploys
kubectl rollout history deployment/api-server
# Rollback if deployed < 30 min ago
kubectl rollout undo deployment/api-server
When to use: Deploy coincides with incident start.
Option B: Scale Up
# Increase replicas
kubectl scale deployment/api-server --replicas=20
When to use: High traffic, resource exhaustion.
Option C: Restart Services
# Restart pods
kubectl rollout restart deployment/api-server
When to use: Memory leak, connection pool issues.
Option D: Enable Circuit Breaker
# Disable failing external service calls
kubectl set env deployment/api-server FEATURE_EXTERNAL_API=false
When to use: Third-party service degraded.
SEV-1: Every 10 minutes SEV-2: Every 30 minutes SEV-3: Hourly
**[14:45] UPDATE**
**Status:** Investigating
**Impact:** API returning 503 errors. ~75% of requests failing.
**Actions Taken:**
- Rolled back deploy from 14:25
- Increased pod replicas to 15
**Next Steps:**
- Monitoring rollback impact
- Investigating database connection issues
**ETA:** Unknown
**Customer Impact:** Users cannot place orders.
**Workaround:** None available.
## Status Messages
**Investigating:**
> We are aware of elevated error rates on the API.
> Investigating the root cause. Updates every 10 minutes.
**Identified:**
> Root cause identified: database connection pool exhausted.
> Implementing fix now.
**Monitoring:**
> Fix deployed. Error rate dropping.
> Monitoring for 30 minutes before declaring resolved.
**Resolved:**
> Incident resolved. Error rate back to baseline.
> Post-mortem to follow.
While service is recovering, investigate root cause:
# Capture logs before they rotate
kubectl logs deployment/api-server > incident-logs.txt
# Snapshot metrics
curl -H "Authorization: Bearer $DD_API_KEY" \
"https://api.datadoghq.com/api/v1/graph/snapshot?..." > metrics.png
# Database state
psql -c "SELECT * FROM pg_stat_activity" > db-state.txt
## Timeline
| Time | Event | Evidence |
|------|-------|----------|
| 14:20 | Deploy started | GitHub Actions log |
| 14:25 | Deploy completed | ArgoCD |
| 14:30 | Error rate spike | Datadog alert |
| 14:32 | Database connections maxed | CloudWatch |
| 14:35 | Rollback initiated | kubectl history |
| 14:38 | Service recovered | Datadog metrics |
## Root Cause
**Immediate Cause:**
Deploy introduced N+1 query pattern in user endpoint.
**Contributing Factors:**
- Missing database index on users.created_at
- No query performance testing in CI
- Database connection pool too small for traffic spike
**Why It Wasn't Caught:**
- Staging has 10x less traffic than production
- Load testing doesn't cover this endpoint
- No alerting on query performance
Criteria (ALL must be met):
## Immediate (Within 1 hour)
- [ ] Post resolution update to #incidents
- [ ] Update status page to "operational"
- [ ] Thank responders
- [ ] Close PagerDuty incident
## Short-term (Within 24 hours)
- [ ] Create post-mortem ticket
- [ ] Schedule post-mortem meeting
- [ ] Extract action items
- [ ] Update runbook with learnings
## Long-term (Within 1 week)
- [ ] Complete action items from post-mortem
- [ ] Add monitoring/alerting to prevent recurrence
- [ ] Document in incident database
# Post-Mortem: API Outage - 2025-01-15
**Date:** 2025-01-15
**Duration:** 14:30 UTC - 14:45 UTC (15 minutes)
**Severity:** SEV-1
**Impact:** 75% of API requests failing
**Authors:** On-call engineer, Team lead
## Summary
On January 15th at 14:30 UTC, our API experienced a complete outage affecting
75% of requests. The incident lasted 15 minutes and was caused by a database
connection pool exhaustion triggered by an N+1 query in a recent deploy.
## Impact
**Customer Impact:**
- ~1,500 users unable to complete purchases
- Estimated revenue loss: $50,000
- 47 support tickets filed
**Internal Impact:**
- 3 engineers pulled from other work
- 15 minutes of complete outage
- Engineering manager paged
## Timeline (All times UTC)
**14:20** - Deploy #1234 merged and started deployment
**14:25** - Deploy completed, new code serving traffic
**14:30** - Alert fired: `api_error_rate_high`
**14:31** - On-call engineer acknowledged
**14:32** - Assessed as SEV-1, created incident channel
**14:33** - Identified database connection pool exhausted
**14:35** - Initiated rollback to previous version
**14:38** - Rollback complete, error rate dropping
**14:40** - Service stabilized, monitoring
**14:45** - Declared resolved
## Root Cause
The deploy introduced an N+1 query in the `/users/recent` endpoint. For each
user returned, the code made an additional database query to fetch their
profile picture URL. With 50 concurrent requests, this resulted in 50 × 20 =
1,000 database queries, exhausting the connection pool (configured for 100
connections).
**Code change:**
```diff
- user.profile_picture_url # Preloaded in query
+ user.get_profile_picture() # Additional query per user
users.created_at not indexed, making base query slow| Action | Owner | Deadline | Priority |
|---|---|---|---|
| Add database index on users.created_at | Alice | 2025-01-16 | P0 |
| Increase connection pool to 200 | Bob | 2025-01-16 | P0 |
| Add query performance test to CI | Charlie | 2025-01-20 | P1 |
| Implement automatic rollback on error spike | Dave | 2025-01-30 | P1 |
| Create ORM query linter to detect N+1 | Eve | 2025-02-15 | P2 |
## On-Call Playbook
```markdown
# On-Call Playbook
## Before Your On-Call Shift
**1 week before:**
- [ ] Review recent incidents
- [ ] Update on-call runbooks if needed
- [ ] Test PagerDuty notifications
**1 day before:**
- [ ] Verify laptop ready (charged, VPN working)
- [ ] Test access to all systems
- [ ] Review current system status
- [ ] Check calendar for conflicting events
## During Your Shift
### When You Get Paged
**Within 1 minute:**
1. Acknowledge alert in PagerDuty
2. Check alert details for severity
3. Open relevant runbook
**Within 5 minutes:**
4. Assess severity (is it really SEV-1?)
5. Create incident channel if SEV-1/SEV-2
6. Post initial status update
### Escalation Decision Tree
Get paged
|
Can I handle this alone?
/ \
Yes No
| |
Work it Escalate
| |
Fixed? Loop in team
/ \ |
Yes No Work together
| | |
Close Need Fixed?
help |
\ Yes
\ |
\ Close
\ |
Escalate
### Handoff Procedure
**End of shift checklist:**
- [ ] No active incidents
- [ ] Status doc updated
- [ ] Next on-call acknowledged handoff
- [ ] Brief next on-call on any ongoing issues
**Handoff template:**
Hey @next-oncall! Handing off on-call. Here's the status:
Active Issues: None
Watch Items:
Recent Incidents:
System Status:
Let me know if you have questions!
## After Your Shift
- [ ] Update runbooks with any new learnings
- [ ] Complete post-mortems for incidents
- [ ] File bug tickets for issues found
- [ ] Share feedback on alerting/runbooks
# Bad: Reactive chaos
EVERYTHING IS DOWN! RESTART ALL THE THINGS!
# Good: Calm assessment
Service is degraded. Let me check:
1. What's the actual impact?
2. When did it start?
3. What's the quickest safe mitigation?
# Bad: Silent fixing
*Fixes issue without telling anyone*
*Marks incident as resolved*
# Good: Regular updates
[14:30] Investigating API errors
[14:40] Root cause identified, deploying fix
[14:45] Fix deployed, monitoring
[15:00] Service stable, incident resolved
# Bad: Move on quickly
Fixed it! Moving on to next task.
# Good: Learn from incidents
- Document what happened
- Identify action items
- Prevent recurrence
- Share learnings with team