From aegis
Diagnoses bugs, test failures, and unexpected behavior by isolating root cause before proposing any fix. Applies layered investigation and triage for shared, cross-module, or contract-sensitive code.
How this skill is triggered — by the user, by Claude, or both
Slash command
/aegis:systematic-debuggingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
→ Bug? Test failure? Unexpected behavior? → **Find root cause first. No fixes without evidence.**
→ Bug? Test failure? Unexpected behavior? → Find root cause first. No fixes without evidence.
Random fixes waste time and create new bugs. Symptom fixes are failure.
This skill is the canonical debugging workflow. Use it to move from symptom to root cause, then to the smallest sufficient stable repair and retirement plan. Smallest repair means correct owner + bug class fixed + bounded entropy, not the smallest textual diff.
Any technical issue: test failures, bugs, unexpected behavior, performance problems, build/integration failures.
Especially under time pressure, when "just one quick fix" seems obvious, after multiple failed fixes, or when duplicate owners / fallback chains may be involved.
For low-risk, single-owner bugs, keep the report compact: Symptom,
Reproduction, Root Cause, Fix Boundary, and Verification. Still collect
root-cause evidence before editing. If fallback, duplicate owner, consumer-side
patching, contract risk, shared logic, or cross-module behavior appears,
escalate to the full workflow.
BEFORE attempting ANY fix:
Read Error Messages Carefully
Reproduce Consistently
feedback-loop-construction.md to build an automated reproduction loop; don't guessCheck Recent Changes
Gather Evidence in Multi-Component Systems
Trace Data Flow (when error is deep in call stack)
root-cause-tracing.md.Drill Upward Through Diagnostic Layers
Start at L1. Exhaust all "why" questions at each layer before moving upward. The chain is open-ended — architecture is not the endpoint.
L1 Symptom: what failed? where? exact reproduction?
L2 Logic: which branch, invariant, or state transition is wrong?
L3 System: which component boundary, dependency, or ownership seam?
L4 Architecture: what design choice, duplicated owner, or fallback chain?
L5 Cross-system: which API / SLA / timing contract between systems?
L6 Platform: what runtime / OS / framework constraint?
L7 Spec gap: who never defined correct behavior for this case?
Hard signal definitions (H/T/D) are in the Quality Gate — apply them there, not during initial investigation.
When the stop layer is not obvious, the user asks where the diagnosis
stops, the issue crosses component/system boundaries, or a user-provided
fact falsifies the current layer, expose a compact Layer Stop Card before
fixing:
Layer Stop Card:
- Current Stop Layer: L1 Symptom | L2 Logic | L3 System | L4 Architecture | L5 Cross-system Contract | L6 Platform | L7 Spec Gap | T-class boundary
- Checked Path:
- Evidence For Stop:
- Excluded Layers:
- Falsifier:
- User Intervention Point:
- Next Action:
The card is an advisory readback of the diagnostic stop point. It is not a
GateDecision, PolicySnapshot, or completion authority.
Patch-Shape Triage Before Editing
Treat the first obvious fix as evidence, not clearance to edit. If the candidate fix shape matches any item below, continue upward before changing code unless you can prove the local layer is the canonical owner:
try/catch, early return, or one-off branchRequired output before editing when this gate fires:
PatchShape:
CanonicalOwner:
UpwardDrillSignal:
Decision: fix owner | continue investigation | escalate
If the tempting fix is "just add a small guard/fallback", also run:
Minimality Check:
- Smallest textual diff:
- Existing owner / reuse path:
- Correct owner:
- Bug class fixed:
- New branch/fallback added:
- Existence proof for new path:
- Old path retired or scheduled:
- Verdict: sufficient repair | local patch | needs first-principles review
local patch is a mitigation, not a sufficient repair, unless it is the
canonical owner and includes a retention reason plus retirement trigger.
For candidate additions that are not ordinary repair code, use
docs/current/AEGIS_MINIMALITY_REFERENCE.md to check whether the new surface
needs to exist before editing.
If the repair or retirement boundary depends on deleting old paths,
retaining compat for a proven external dependency, or stopping on
persistent-state risk, compose anti-entropy-governance before editing. It
decides the path; it does not grant destructive authority.
Pre-Edit Complexity Check
After root cause and canonical owner are identified, check whether the fix adds complexity to the wrong or overloaded place:
Use using-aegis/references/complexity-governance.md for shared pressure
signals and the meaning of over-budget.
Pre-Edit Complexity Check:
- Target edit file:
- Existing pressure signal:
- Owner fit:
- Safer edit boundary:
- Decision: edit-in-place | extract helper | add owner file | split task | pause for plan update
If the safer boundary changes the implementation shape, pause and update the plan/spec.
If the likely repair would grow an already oversized maintained artifact and the slice cannot govern that growth immediately, do not present the repair as a completed fix boundary. Escalate with a plan update or a visible follow-up requirement.
Before claiming a root cause and entering Phase 4, check whether the Pre-Claim Gate applies. It applies whenever any Patch-Shape Triage signal is active (candidate fix is a guard, fallback, consumer/caller patch, artifact/cache patch, or sample-only naming — i.e. H1 / H3 / H8 / H10 / H11 / H13), or whenever the diagnosis crosses a component or system boundary, or a previous fix left a residual symptom.
When it applies, do not state a root cause or edit code until the five
mechanical checks below pass. See root-cause-claim-contract.md for the full
rationale, the six-topology table, and a worked example.
root-cause-claim-contract.md.Required output before entering Phase 4 when the gate fires:
Pre-Claim Gate Pass:
Topology: single-root | single-root-multi-symptom | chain | independent-compound | conjunctive-cluster | disjunctive-or
CausalClosure: closed | open-edge: <edge>
Falsifier: <if not-X then F; F checked: yes/no>
SelfRefutation: <strongest objection> -> <why it does not hold>
LayerCeiling: <L?> -> <why L?+1 unreachable>
Verdict: pass | fail-<which-gate>
This gate is advisory method-pack discipline. It is not a GateDecision,
PolicySnapshot, evidence sufficiency authority, or completion authority. It
turns a self-judged stop ("I think this is deep enough") into a checkable,
falsifiable claim ("here is the evidence chain, the falsifier I checked, the
objection I survived, and the ceiling I reached"). The quick bug lane is
exempt when no Patch-Shape signal fires and the bug is single-owner at the
canonical owner.
Fix the root cause, not the symptom:
Create Failing Test Case
Implement Single Fix
Verify Fix
If Fix Doesn't Work
4bis. Post-Fix Differential Diagnosis
After applying a fix, if ANY symptom persists:
STOP. Do NOT attempt another fix without diagnosis.
| Residual pattern | Diagnosis | Action |
|---|---|---|
| Same reproduction conditions as fixed symptom | Fix is incomplete | Continue upward drilling from same source |
| Different reproduction conditions, chains converge to same source | Fix was at wrong depth | Drill upward again from the shared source |
| Different reproduction conditions, chains diverge | Compound root cause (≥2 independent roots) | Each root needs its own fix |
| Same symptom, reduced but not eliminated | Fix was a downstream patch | Drill upward again from source |
Compound root cause forms (legacy shorthand):
Causal Topology Gate (full form, used by the Pre-Claim Gate): the three
legacy forms above are a shorthand. Before claiming any root — single or
compound — classify the topology explicitly. The default is unknown; you
must actively exclude the multi-root topologies before collapsing to a
single-root claim. See root-cause-claim-contract.md for the full table,
member necessity/sufficiency tests, and the anti-disguise check.
| Topology | Structure | Stop condition | Repair shape |
|---|---|---|---|
single-root | A → symptom | Layer Ceiling Proof at A | fix A |
single-root-multi-symptom | A → B, C, D | Layer Ceiling Proof at A | fix A; symptoms self-resolve |
chain | A → B → C → symptom | Layer Ceiling Proof at A | drill to A, fix A |
independent-compound | A → symptom, Y → symptom, A ⊥ Y | each root passes Gate 1/2/5; no shared upstream | fix A and Y; missing one leaves symptom |
conjunctive-cluster | A ∧ B ∧ C → symptom (each necessary, none sufficient) | enumerate members, necessity test each, sufficiency test the set, anti-disguise check | fix all members; missing one leaves symptom |
disjunctive-or | A ∨ B → symptom (any one suffices) | enumerate all disjuncts | fix one to stop symptom; enumerate rest for defense-in-depth |
Member proof (cluster / compound): each claimed member must pass a necessity test ("if this member alone were removed, would the symptom still occur?" — if yes, it is not a member). The set must pass a sufficiency test (together the members explain every observed manifestation). Necessity tests here are conceptual proofs, not empirical runs — a method-pack ceiling; state this honestly when the cluster has many members.
Anti-disguise check (most often skipped): before accepting
conjunctive-cluster, ask whether members X and Y share a deeper common
cause Z, such that X and Y are merely two manifestations of Z. If yes, the
topology collapses to single-root-multi-symptom or chain rooted at Z —
drill to Z. The reverse check protects independent-compound: if two
divergent chains share upstream Z, they are not independent and Z is the
root.
If 3+ Fixes Failed: Question Architecture
Pattern indicating architectural problem:
STOP and question fundamentals. Discuss with your human partner before attempting more fixes. This is NOT a failed hypothesis — this is a wrong architecture.
Deliver Dual-Track Closure
For bug fixes, refactors, contract changes, or governance cleanup, always produce:
Repair track — root cause, canonical owner, smallest necessary change, compatibility boundary, verification method.
Retirement track — old owner / fallback / patch, whether it is still active on the main path, the only reason to keep it (if any), trigger for deletion, verification needed before removal.
Never add a new owner, fallback, prompt branch, or adapter path without stating what happens to the old one.
Before you claim debugging is complete:
Workspace record for non-trivial debugging — if this is medium+ complexity
or it writes docs/aegis/ records, initialize/check through configured
Aegis workspace support when available:
python <aegis-workspace-helper> init --root <target-project-root>
python <aegis-workspace-helper> new-work --root <target-project-root> ...
python <aegis-workspace-helper> add-evidence --root <target-project-root> --work <YYYY-MM-DD-slug> ...
python <aegis-workspace-helper> check --root <target-project-root>
Fast bug fix or quick bug fix pressure does not skip this: if Ripple Signal Triage fires, do the triage before editing and expand verification to the canonical owner plus affected downstream path.
These records are method-pack evidence trails only. They do not grant authoritative completion.
Stop-when review — re-read the diagnostic layer where you stopped. Did you reach "no deeper why remains" or a T-class terminal boundary? If the chain ended at L1-L2 and the evidence is conclusive, that is a valid endpoint. If there are still unexplained "why" questions, continue upward drilling before claiming done.
Layer Stop Card when the stop point affects the fix boundary,
contract owner, spec/product decision, or user correction path. Keep
simple fast-path explanations cheap; do not emit the card for ordinary
factual Q&A about the skill itself.Hard signal check — apply these countable facts, not judgments:
Must continue upward drilling (H-class — ANY hit = NOT done):
if / switch / catch / try)git log --grep shows this symptom was "fixed" before → Read that commit's diff. Understand why it failed. Do not repeat the same patch pattern.conjunctive-cluster or independent-compound but the member set is not enumerated, or a member was not necessity-testedconjunctive-cluster or independent-compound without running the anti-disguise check (a shared upstream Z may collapse the cluster/compound to a single root)Terminal unactionable (T-class — any hit = stop drilling, switch to mitigation):
Depth sufficient (D-class — ALL must pass before claiming done):
sufficient repair, or the local
patch is explicitly bounded with retention reason and retirement triggerconjunctive-cluster or independent-compound, every
member is enumerated and necessity-tested, and the set is sufficiency-testedconjunctive-cluster
or independent-compound classification (a shared upstream Z was sought)Reflection — re-run Goal / DeeperCause / Evidence / Risk/Unknown / Decision
Confirm the fix addressed the source, not just the sample
Retirement surface — did it shrink, stay, or grow?
Confidence:
A = direct evidence and regression coverage support the root-cause conclusionB = strong evidence, limited coverage or some bounded unknowns remainC = partial evidence only; do not present as fully resolvedIf confidence is not at least B, do not speak as if the issue is fully closed.
If you catch yourself thinking:
ALL of these mean: STOP. Return to Phase 1.
If 3+ fixes failed: Question the architecture (see Phase 4 Step 5) If symptoms persist after fix: Run differential diagnosis (see Phase 4 Step 4bis)
If you hear "Is that not happening?", "Will it show us...?", "Stop guessing", "Ultrathink this" → STOP. Return to Phase 1.
If investigation reveals the issue is truly environmental, timing-dependent, or external: document what you investigated, implement appropriate handling (retry, timeout, error message), add monitoring.
See root-cause-tracing.md, defense-in-depth.md, condition-based-waiting.md, feedback-loop-construction.md, and root-cause-claim-contract.md in this directory for deeper guidance on specific diagnostic scenarios.
npx claudepluginhub ganyuanran/aegis --plugin aegisEnforces 4-phase root cause investigation for bugs, errors, test failures, unexpected behavior, and technical issues before proposing fixes.
Enforces systematic root cause analysis for bugs, test failures, and unexpected behavior. Requires proof of root cause before any fix.
Systematic debugging methodology for finding and fixing bugs through root cause analysis. Covers reproduce-investigate-hypothesize-fix-prevent workflow, evidence-based diagnosis, and bug category strategies.