Root Cause Analysis
Symptom → hypothesis formation → evidence gathering → elimination → root cause → verified fix.
<when_to_use>
- Diagnosing system failures or unexpected behavior
- Investigating incidents or outages
- Finding the actual cause vs surface symptoms
- Preventing recurrence through understanding
- Any situation where "why did this happen?" needs answering
NOT for: known issues with documented fixes, simple configuration errors, guessing without evidence
</when_to_use>
<discovery_phase>
Core Questions
| Question | Why it matters |
|---|
| What's the symptom? | Exact manifestation of the problem |
| When did it start? | First occurrence, patterns in timing |
| Can you reproduce it? | Consistently, intermittently, specific conditions |
| What changed recently? | Deployments, config, dependencies, environment |
| What have you tried? | Previous fix attempts, their results |
| What are the constraints? | Time budget, what can't be modified, rollback options |
Confidence Thresholds
| Level | Discovery State | Action |
|---|
░░░░░ 0 | Symptom unclear | Keep gathering info, don't investigate yet |
▓░░░░ 1 | Symptom clear, can't reproduce | Focus on reproduction |
▓▓░░░ 2 | Can reproduce, context unclear | Gather environment/history |
▓▓▓░░ 3 | Good context, some gaps | Can start hypothesis phase |
▓▓▓▓░ 4+ | Clear picture | Proceed to investigation |
At level 3+, transition from discovery to hypothesis formation. Below level 3, keep gathering context.
</discovery_phase>
<hypothesis_formation>
Hypothesis Quality
Good hypothesis:
- Testable — can design experiment to verify
- Falsifiable — can prove it wrong
- Specific — points to concrete cause
- Plausible — consistent with evidence
Weak hypothesis:
- Too broad — "something's wrong with the system"
- Untestable — "maybe cosmic rays"
- Contradicts evidence — ignores known facts
- Assumes conclusion — "X is broken because X always breaks"
Multiple Working Hypotheses
Generate 2–4 competing theories:
- List each hypothesis
- Note supporting evidence
- Note contradicting evidence
- Rank by likelihood
- Design tests to differentiate
Ranking Criteria
- Evidence support — how much data backs this?
- Parsimony — simplest explanation?
- Prior probability — how often does this cause this symptom?
- Testability — can verify quickly?
</hypothesis_formation>
<evidence_gathering>
Observation Collection
Gather concrete data:
- Error manifestation — exact symptoms, messages, states
- Reproduction steps — minimal sequence triggering issue
- System state — logs, variables, config at failure time
- Environment — versions, platform, dependencies
- Timing — when it started, frequency, patterns
Symptom Classification
Distinguish:
- Primary symptom — what user/system experiences
- Secondary symptoms — cascading effects
- Red herrings — coincidental but unrelated
- Intermittent vs consistent — failure pattern
Breadcrumb Analysis
Trace backwards from symptom:
- Last known good state — what was working?
- First observable failure — when did it break?
- Changes between — what's different?
- Correlation vs causation — timing vs actual cause
- Root trigger — first thing that went wrong
</evidence_gathering>
<hypothesis_testing>
Experimental Design
For each hypothesis:
- Prediction — if true, what should we observe?
- Test method — how to check?
- Expected result — what confirms/refutes?
- Time budget — how long to spend?
- Stop condition — when to move on?
Testing Strategies
Simplest first:
- Quick tests before slow tests
- Non-destructive before destructive
- Local before remote
Highest probability first:
- Most likely cause before edge cases
- Common failures before rare failures
- Recent changes before old components
Elimination approach:
- Binary search problem space
- Isolate variables one at a time
- Narrow scope systematically
Test Execution
- Baseline — confirm issue still present
- Single variable — change one thing
- Observe — what happened?
- Document — record result before next test
- Iterate — adjust hypothesis or try next test
</hypothesis_testing>
<elimination_methodology>
Binary Search
For large problem spaces:
- Identify range — good state, bad state
- Test midpoint — does issue exist here?
- Narrow range — move to half with issue
- Repeat — until single change identified
Works for: finding breaking changes, isolating components, narrowing scope
Variable Isolation
Test one change at a time:
- Baseline — measure with all defaults
- Change X — measure impact
- Revert X, change Y — measure impact
- Repeat for each variable
- Combinations if interactions suspected
Process of Elimination
What it's NOT:
- ✗ Not component A (tested isolation)
- ✗ Not component B (reproduced without)
- ✗ Not external factor (reproduced in clean environment)
- ✓ Must be in remaining scope
Systematically rule out possibilities until one remains.
</elimination_methodology>
<time_boxing>
Phase Durations
| Phase | Duration | Exit Condition |
|---|
| Discovery | 5–10 min | Questions answered, can reproduce |
| Hypothesis | 10–15 min | 2–4 testable theories ranked |
| Testing | 15–30 min per hypothesis | Confirmed or ruled out |
| Fix | Variable | Root cause addressed |
| Verification | 10–15 min | Fix confirmed, prevention documented |
If stuck in any phase beyond 2x estimate → step back, seek fresh perspective, or escalate.
</time_boxing>
<audit_trail>
Investigation Log
Log every step for replay and review:
[TIME] Checked evidence → found specific data
[TIME] Hypothesis: possible cause based on evidence
[TIME] Test: what was tried → result observed
[TIME] Hypothesis ruled out/confirmed, reason
[TIME] New hypothesis based on new evidence
Benefits:
- Prevents revisiting same ground
- Enables handoff to others
- Creates learning artifact
- Catches circular investigation
</audit_trail>
<common_pitfalls>
Resistance Patterns
Rationalizations that derail investigation:
| Thought | Why it's wrong |
|---|
| "I already looked at that" | Memory unreliable; re-examine with fresh evidence |
| "That can't be the issue" | Assumptions block investigation; test anyway |
| "We need to fix this quickly" | Pressure leads to random changes, not solutions |
| "The logs don't show anything" | Absence of evidence ≠ evidence of absence |
| "It worked before" | Systems change; past behavior doesn't guarantee current |
| "Let me just try this one thing" | Random trial without hypothesis wastes time |
When you catch yourself thinking these → pause, return to methodology.
Confirmation Bias
Avoid:
- Seeing only evidence supporting pet hypothesis
- Ignoring contradictory data
- Stopping investigation once you find "a" cause
Counter:
- Actively seek disconfirming evidence
- Test alternative hypotheses
- Ask "what would prove me wrong?"
Correlation ≠ Causation
Avoid:
- "It started when X changed" → X caused it
- "Happens at specific time" → time is the cause
Counter:
- Test direct causal mechanism
- Look for confounding variables
- Verify by removing supposed cause
</common_pitfalls>
<documentation>
## Root Cause Report
At conclusion:
- Summary — what was broken, what fixed it
- Root cause — ultimate source of issue
- Contributing factors — what made it worse
- Evidence — data supporting conclusion
- Prevention — how to avoid recurrence
Lessons Learned
Extract patterns:
- Early indicators — what could have caught this sooner?
- Investigation efficiency — what worked well/poorly?
- Knowledge gaps — what did we not know?
- Process improvements — how to prevent similar issues?
</documentation>
<confidence_calibration>
High confidence (▓▓▓▓▓):
- Consistent reproduction
- Clear cause → effect demonstrated
- Multiple independent confirmations
- Fix verified working
Moderate confidence (▓▓▓░░):
- Reproduces most times
- Correlation strong but not proven causal
- Single confirmation
- Fix appears to work
Low confidence (▓░░░░):
- Inconsistent reproduction
- Correlation unclear
- Unverified hypothesis
- Fix untested
</confidence_calibration>
<rules>
ALWAYS:
- Gather sufficient context before hypothesizing
- Form multiple competing hypotheses
- Test systematically, one variable at a time
- Document investigation trail
- Verify fix actually addresses root cause
- Document for future prevention
NEVER:
- Jump to solutions without diagnosis
- Trust single hypothesis without testing alternatives
- Apply fixes without understanding cause
- Skip verification of fix
- Repeat same failed investigation steps
- Hide uncertainty about root cause
</rules>
<references>
Related skills:
- [debugging-and-diagnosis](../debugging-and-diagnosis/SKILL.md) — code-specific debugging (loads this skill)
- [codebase-analysis](../codebase-analysis/SKILL.md) — uses for code investigation
- [report-findings](../report-findings/SKILL.md) — presenting investigation results
</references>