From grimoire
Documents incidents, outages, or production failures with blameless post-mortems. Includes timeline, root cause analysis, and action items.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:write-post-mortemThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Write a blameless post-mortem that captures what happened, why it happened, and what will prevent recurrence — without blaming individuals.
Write a blameless post-mortem that captures what happened, why it happened, and what will prevent recurrence — without blaming individuals.
Adopted by: Amazon Web Services (internal COE process), Google SRE teams, Etsy (pioneered blameless post-mortems in 2012), PagerDuty, Atlassian, Netflix.
Impact: Google's SRE book reports that teams with structured post-mortems reduce mean time between incidents (MTBI) by 20–40%. Etsy's blameless culture is credited with enabling 50+ deploys per day without increased incident rate. A 2023 Puppet State of DevOps report found that high-performing teams are 2.4× more likely to conduct blameless post-mortems.
Why best: Blameless post-mortems surface systemic failures rather than hiding them behind individual blame. When engineers fear punishment, they under-report near-misses and route around broken systems. The blameless model assumes engineers are competent and acted rationally given their information at the time — the fault lies in the system, not the person.
Sources: Google SRE Book (Chapter 15); John Allspaw, "Blameless Post-Mortems and a Just Culture" (Etsy Engineering Blog, 2012); Puppet State of DevOps Report 2023.
Open a draft within 24–48 hours of resolution while details are fresh. Assign a single author; others contribute via comments.
Write the incident summary (3–5 sentences): what failed, what was the user-visible impact, when it started and ended, and severity level (P0/P1/P2 or equivalent).
Build the timeline in chronological order with UTC timestamps. Include: first alert fired, who was paged, each diagnostic action taken, each mitigation attempted, resolution time, and all-clear time. Be factual — no editorializing.
State the root cause in one sentence using the "five whys" technique: ask "why did this happen?" iteratively until you reach a systemic cause, not a human action. Example: not "an engineer deleted the table" but "a migration script had no dry-run mode and no confirmation prompt in production."
List contributing factors — conditions that allowed the root cause to manifest. Examples: missing monitoring, inadequate runbooks, insufficient test coverage, unclear ownership, alert fatigue.
Write action items — each must be: specific (not "improve monitoring"), assigned to a named owner, and have a due date. Categorize as: preventive (stops recurrence), detective (catches it sooner), or corrective (reduces blast radius). Aim for 3–7 actionable items, not 20 aspirational ones.
State what went well — tools that worked, responders who acted effectively, communication that helped. This reinforces good practices and is not sycophancy.
Publish to a shared, searchable incident log (Confluence, Notion, internal wiki). Notify stakeholders. Schedule a 30-minute review meeting if the incident was P0/P1.
Root cause (bad): "Engineer forgot to set the timeout flag."
Root cause (good): "The deployment checklist did not include a timeout configuration step, and no automated validation checked for missing timeout settings before deployment to production."
Action item (bad): "Be more careful with production deployments."
Action item (good): "Add timeout validation to the pre-deploy CI check. Owner: @platform-team. Due: 2024-02-15."
Five-whys example:
npx claudepluginhub jeffreytse/grimoire --plugin grimoireGuides writing blameless postmortems for SEV1/SEV2 incidents using templates, timelines, root cause analysis, and action items to foster learning.
Guides writing blameless postmortems for incident reviews, root cause analysis, and organizational learning.
Conducts blameless postmortems for outages and incidents with timeline reconstruction, root cause analysis (5 Whys, fishbone), and corrective action tracking.