Use when responding to production incidents following SRE principles and best practices.
Read-only skill
Additional assets for this skill
This skill cannot use any tools. It operates in read-only mode without the ability to modify files or execute commands.
Managing incidents and conducting effective postmortems.
Alert fires ā On-call acknowledges ā Initial assessment
- Assess severity
- Page additional responders if needed
- Establish incident channel
- Assign incident commander
- Identify mitigation options
- Execute fastest safe mitigation
- Monitor for improvement
- Escalate if not improving
- Verify service health
- Communicate resolution
- Document actions taken
- Schedule postmortem
- Conduct postmortem
- Identify action items
- Track completion
- Update runbooks
šØ INCIDENT DECLARED - P0
Service: API Gateway
Impact: All API requests failing
Started: 2024-01-15 14:23 UTC
IC: @alice
Status Channel: #incident-001
Current Status: Investigating
Next Update: 30 minutes
š INCIDENT UPDATE #2 - P0
Service: API Gateway
Elapsed: 45 minutes
Progress: Identified root cause as database connection pool exhaustion.
Mitigation: Increasing pool size and restarting services.
ETA to Resolution: 15 minutes
Next Update: 15 minutes or when resolved
ā
INCIDENT RESOLVED - P0
Service: API Gateway
Duration: 1h 12m
Impact: 100% of API requests failed
Resolution: Increased database connection pool and restarted services.
Next Steps:
- Postmortem scheduled for tomorrow 10am
- Monitoring for recurrence
- Action items being tracked in #incident-001
# Incident Postmortem: API Outage 2024-01-15
## Summary
On January 15th, our API was completely unavailable for 72 minutes due to
database connection pool exhaustion.
## Impact
- Duration: 72 minutes (14:23 - 15:35 UTC)
- Severity: P0
- Users Affected: 100% of API users (~50,000 requests failed)
- Revenue Impact: ~$5,000 in SLA credits
## Timeline
**14:23** - Alerts fire for elevated error rate
**14:25** - IC paged, incident channel created
**14:30** - Identified all database connections exhausted
**14:45** - Decided to increase pool size
**15:00** - Configuration deployed
**15:15** - Services restarted
**15:35** - Error rate returned to normal, incident resolved
## Root Cause
Database connection pool was sized for normal load (100 connections).
Traffic spike from new feature launch (3x normal) exhausted connections.
No alerting existed for connection pool utilization.
## What Went Well
- Detection was quick (2 minutes from issue start)
- Team assembled rapidly
- Clear communication maintained
## What Didn't Go Well
- No capacity testing before feature launch
- Connection pool metrics not monitored
- No automated rollback capability
## Action Items
1. [P0] Add connection pool utilization monitoring (@bob, 1/17)
2. [P0] Implement automated rollback for deploys (@charlie, 1/20)
3. [P1] Establish capacity testing process (@diana, 1/25)
4. [P1] Increase connection pool to 300 (@bob, 1/16)
5. [P2] Update deployment runbook with load testing (@eve, 1/30)
## Lessons Learned
- Always load test before launching features
- Monitor resource utilization at all layers
- Have rollback mechanisms ready
# Runbook: High Database Latency
## Symptoms
- Database query times > 500ms
- Elevated API latency
- Alert: DatabaseLatencyHigh
## Impact
Users experience slow page loads. P1 severity if p95 > 1s.
## Investigation
1. Check database metrics in Grafana
https://grafana.example.com/d/db-overview
2. Identify slow queries:
```sql
SELECT * FROM pg_stat_statements
ORDER BY total_time DESC LIMIT 10;
Check for locks:
SELECT * FROM pg_stat_activity
WHERE state = 'active';
Quick fixes:
Escalation: If latency > 2s for > 15 minutes, page DBA team.
## Best Practices
### Blameless Culture
- Focus on systems, not individuals
- Assume good intentions
- Learn from mistakes
- Reward transparency
### Clear Severity Definitions
- Severity should be based on user impact
- Document response time expectations
- Update definitions based on learnings
### Practice Incident Response
- Run "game days" quarterly
- Practice different scenarios
- Test on-call handoffs
- Review and improve runbooks
### Track Action Items
- Assign owners and due dates
- Review in team meetings
- Close loop on completion
- Measure time to completion