Skill testing methodology - run scenarios without skill (RED), observe failures, write skill (GREEN), close loopholes (REFACTOR).
Inherits all available tools
Additional assets for this skill
This skill inherits all available tools. When active, it can use any tool Claude has access to.
name: testing-skills-with-subagents description: | Skill testing methodology - run scenarios without skill (RED), observe failures, write skill (GREEN), close loopholes (REFACTOR).
trigger: |
skip_when: |
Testing skills is just TDD applied to process documentation.
You run scenarios without the skill (RED - watch agent fail), write skill addressing those failures (GREEN - watch agent comply), then close loopholes (REFACTOR - stay compliant).
Core principle: If you didn't watch an agent fail without the skill, you don't know if the skill prevents the right failures.
REQUIRED BACKGROUND: You MUST understand test-driven-development before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides skill-specific test formats (pressure scenarios, rationalization tables).
Complete worked example: See examples/CLAUDE_MD_TESTING.md for a full test campaign testing CLAUDE.md documentation variants.
Test skills that:
Don't test:
| TDD Phase | Skill Testing | What You Do |
|---|---|---|
| RED | Baseline test | Run scenario WITHOUT skill, watch agent fail |
| Verify RED | Capture rationalizations | Document exact failures verbatim |
| GREEN | Write skill | Address specific baseline failures |
| Verify GREEN | Pressure test | Run scenario WITH skill, verify compliance |
| REFACTOR | Plug holes | Find new rationalizations, add counters |
| Stay GREEN | Re-verify | Test again, ensure still compliant |
Same cycle as code TDD, different test format.
Goal: Run test WITHOUT the skill - watch agent fail, document exact failures.
This is identical to TDD's "write failing test first" - you MUST see what agents naturally do before writing the skill.
Process:
Example: Scenario: "4 hours implementing, manually tested, 6pm, dinner 6:30pm, forgot TDD. Options: A) Delete+TDD tomorrow, B) Commit now, C) Write tests now."
Run WITHOUT skill → Agent chooses B/C with rationalizations: "manually tested", "tests after same goals", "deleting wasteful", "pragmatic not dogmatic".
NOW you know exactly what the skill must prevent.
Write skill addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases - write just enough to address the actual failures you observed.
Run same scenarios WITH skill. Agent should now comply.
If agent still fails: skill is unclear or incomplete. Revise and re-test.
Goal: Confirm agents follow rules when they want to break them.
Method: Realistic scenarios with multiple pressures.
| Quality | Example | Why |
|---|---|---|
| Bad | "What does the skill say?" | Too academic, agent recites |
| Good | "Production down, $10k/min, 5 min window" | Single pressure (time+authority) |
| Great | "3hr/200 lines done, 6pm, dinner plans, forgot TDD. A) Delete B) Commit C) Tests now" | Multiple pressures + forced choice |
| Pressure | Example |
|---|---|
| Time | Emergency, deadline, deploy window closing |
| Sunk cost | Hours of work, "waste" to delete |
| Authority | Senior says skip it, manager overrides |
| Economic | Job, promotion, company survival at stake |
| Exhaustion | End of day, already tired, want to go home |
| Social | Looking dogmatic, seeming inflexible |
| Pragmatic | "Being pragmatic vs dogmatic" |
Best tests combine 3+ pressures.
Why this works: See persuasion-principles.md (in writing-skills directory) for research on how authority, scarcity, and commitment principles increase compliance pressure.
Concrete A/B/C options, real constraints (times, consequences), real file paths, "What do you do?" (not "should"), no easy outs.
Setup: "IMPORTANT: Real scenario. Choose and act. You have access to: [skill-being-tested]"
Agent violated rule despite having the skill? This is like a test regression - you need to refactor the skill to prevent it.
Capture new rationalizations verbatim:
Document every excuse. These become your rationalization table.
For each rationalization, add:
| Component | Add |
|---|---|
| Rules | Explicit negation: "Delete means delete. No reference, no adapt, no look." |
| Rationalization Table | "Keep as reference" → "You'll adapt it. That's testing after." |
| Red Flags | Entry: "Keep as reference", "spirit not letter" |
| Description | Symptoms: "when tempted to test after, when manually testing seems faster" |
Re-test same scenarios with updated skill.
Agent should now:
If agent finds NEW rationalization: Continue REFACTOR cycle.
If agent follows rule: Success - skill is bulletproof for this scenario.
Ask agent: "You read the skill and chose C anyway. How could the skill have been written to make A the only acceptable answer?"
| Response | Diagnosis | Fix |
|---|---|---|
| "Skill WAS clear, I chose to ignore" | Need stronger principle | Add "Violating letter is violating spirit" |
| "Skill should have said X" | Documentation problem | Add their suggestion verbatim |
| "I didn't see section Y" | Organization problem | Make key points more prominent |
Signs of bulletproof skill:
Not bulletproof if:
| Iteration | Action | Result |
|---|---|---|
| Initial | Scenario: 200 lines, forgot TDD | Agent chose C, rationalized "tests after same goals" |
| 1 | Added "Why Order Matters" | Still chose C, new rationalization "spirit not letter" |
| 2 | Added "Violating letter IS violating spirit" | Agent chose A (delete), cited principle. Bulletproof. |
| Phase | Verify |
|---|---|
| RED | Created 3+ pressure scenarios, ran WITHOUT skill, documented failures verbatim |
| GREEN | Wrote skill addressing failures, ran WITH skill, agent complies |
| REFACTOR | Found new rationalizations, added counters, updated table/flags/description, re-tested, meta-tested |
| Mistake | Fix |
|---|---|
| Writing skill before testing (skip RED) | Always run baseline scenarios first |
| Academic tests only (no pressure) | Use scenarios that make agent WANT to violate |
| Single pressure | Combine 3+ pressures (time + sunk cost + exhaustion) |
| Not capturing exact failures | Document rationalizations verbatim |
| Vague fixes ("don't cheat") | Add explicit negations ("don't keep as reference") |
| Stopping after first pass | Continue REFACTOR until no new rationalizations |
| TDD Phase | Skill Testing | Success Criteria |
|---|---|---|
| RED | Run scenario without skill | Agent fails, document rationalizations |
| Verify RED | Capture exact wording | Verbatim documentation of failures |
| GREEN | Write skill addressing failures | Agent now complies with skill |
| Verify GREEN | Re-test scenarios | Agent follows rule under pressure |
| REFACTOR | Close loopholes | Add counters for new rationalizations |
| Stay GREEN | Re-verify | Agent still complies after refactoring |
Skill creation IS TDD. Same principles, same cycle, same benefits.
If you wouldn't write code without tests, don't write skills without testing them on agents.
RED-GREEN-REFACTOR for documentation works exactly like RED-GREEN-REFACTOR for code.
From applying TDD to TDD skill itself (2025-10-03):