Structured workflow for production incident management following SRE best practices. Covers incident declaration, triage, coordination, resolution, and post-mortem.
Inherits all available tools
Additional assets for this skill
This skill inherits all available tools. When active, it can use any tool Claude has access to.
name: ops-incident-response description: | Structured workflow for production incident management following SRE best practices. Covers incident declaration, triage, coordination, resolution, and post-mortem.
trigger: |
skip_when: |
This skill defines the structured process for handling production incidents. It MUST be followed for all SEV1, SEV2, and SEV3 incidents.
See shared-patterns/incident-severity.md for severity definitions.
| Phase | Focus | Owner |
|---|---|---|
| 1. Detection | Identify and confirm incident | Monitoring/On-call |
| 2. Declaration | Assess severity, declare incident | Incident Commander |
| 3. Triage | Identify impact and initial hypothesis | Response Team |
| 4. Mitigation | Restore service, implement workaround | Engineering Team |
| 5. Resolution | Permanent fix, verification | Engineering Team |
| 6. Post-Incident | RCA, action items, documentation | Incident Commander |
Trigger: Alert fires or user report received.
Owner: First responder declares incident, assigns severity.
| Criteria | SEV1 | SEV2 | SEV3 |
|---|---|---|---|
| Complete outage | X | ||
| Data loss risk | X | ||
| >50% users affected | X | ||
| <50% users affected | X | ||
| Workaround available | X |
See shared-patterns/incident-severity.md for complete definitions.
Create incident channel (if SEV1/SEV2):
#incident-YYYY-MM-DD-brief-descriptionAssign Incident Commander (IC):
Update status page (if customer-facing):
**INCIDENT DECLARED**
**Severity:** SEV[1/2/3]
**Title:** [Brief description]
**Incident Commander:** @[name]
**Channel:** #incident-[date]-[slug]
**Impact:**
- Services affected: [list]
- Users affected: [count/percentage]
- Started: [timestamp UTC]
**Current Status:**
[Brief description of current state]
**Next Update:** [timestamp]
Owner: Incident Commander coordinates, engineering investigates.
Update frequency by severity:
| Severity | Internal Update | External Update |
|---|---|---|
| SEV1 | Every 10 min | Every 15 min |
| SEV2 | Every 15 min | Every 30 min |
| SEV3 | Every 30 min | As needed |
Owner: Engineering implements fix, IC coordinates.
**MITIGATION IN PROGRESS**
**Action:** [description]
**Owner:** @[name]
**Started:** [timestamp]
**Verification:**
- [ ] [criterion 1]
- [ ] [criterion 2]
**Rollback Plan:**
[If mitigation fails, do X]
Owner: Engineering confirms fix, IC verifies resolution.
ALL must be true before marking resolved:
**INCIDENT RESOLVED**
**Duration:** [X hours Y minutes]
**Resolution Time:** [timestamp UTC]
**Root Cause:**
[Brief description of what caused the incident]
**Fix Applied:**
[What was done to resolve]
**Next Steps:**
- [ ] RCA scheduled for [date]
- [ ] Action items tracked in [location]
**Retrospective:** [date/time]
Owner: Incident Commander schedules RCA, tracks action items.
| Severity | RCA Required | Timeline |
|---|---|---|
| SEV1 | MANDATORY | 48 hours |
| SEV2 | MANDATORY | 1 week |
| SEV3 | Optional | 2 weeks |
# Incident Post-Mortem: [Title]
**Incident ID:** INC-YYYY-NNNN
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** SEV[1/2/3]
**Author:** @[incident commander]
## Summary
[2-3 sentence summary of what happened]
## Impact
- **Users Affected:** [count/percentage]
- **Revenue Impact:** [if applicable]
- **SLA Impact:** [if applicable]
## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | [event] |
## Root Cause
[Technical description of the root cause]
## Contributing Factors
1. [Factor 1]
2. [Factor 2]
## What Went Well
1. [Item 1]
2. [Item 2]
## What Could Be Improved
1. [Item 1]
2. [Item 2]
## Action Items
| Item | Owner | Due Date | Status |
|------|-------|----------|--------|
| [action] | @[name] | YYYY-MM-DD | Open |
## Lessons Learned
[Key takeaways for the team]
| Rationalization | Why It's WRONG | Required Action |
|---|---|---|
| "Document later, fix first" | Memory fades in hours | Document AS you fix |
| "Small incident, skip RCA" | Small incidents reveal systemic issues | RCA for SEV1/SEV2 minimum |
| "Root cause is obvious" | Obvious != correct | Investigate with data |
| "Skip verification period" | Premature resolution = reopen | Wait full verification period |
| User Says | Your Response |
|---|---|
| "Mark resolved now, verify later" | "Cannot mark resolved until verification complete. This prevents reopened incidents." |
| "Skip the RCA, we know what happened" | "RCA is mandatory for this severity. Schedule within required timeline." |
| "No time for documentation" | "Real-time documentation takes 30 seconds per event. Memory loss causes worse rework." |
For complex incidents, dispatch the incident-responder agent:
Task tool:
subagent_type: "incident-responder"
model: "opus"
prompt: |
INCIDENT: [description]
SEVERITY: SEV[X]
CURRENT STATUS: [state]
REQUEST: [specific assistance needed]