Use after pressure testing passes to validate that skills produce expert-level output quality, not just process compliance - uses expert-agent comparison to detect missing tacit knowledge, mechanical application, and quality gaps, generating actionable feedback for skill improvement
Inherits all available tools
Additional assets for this skill
This skill inherits all available tools. When active, it can use any tool Claude has access to.
Testing-skills-with-subagents validates process compliance under pressure. This skill validates output quality and expertise transfer.
Agents can follow a skill perfectly but still produce mediocre work if the skill:
This skill uses expert-agent comparison testing to detect these gaps and generate actionable feedback for skill improvement.
Core principle: If skill-guided output quality doesn't match expert-level output quality, the skill is missing expertise.
Use this skill when:
Don't use for:
Common failure pattern:
Root causes:
This skill detects these gaps and shows you exactly what to add to your skill.
Run identical scenarios through two agents:
Agent A (Expert Baseline): No skill, pure domain expertise
Agent B (Skill-Guided): With skill being tested
Agent C (Analyzer): Compares outputs, identifies gaps
If Agent B output matches Agent A: Skill successfully transfers expertise ✓
If Agent B output is worse than Agent A: Skill has gaps ✗
Create 3-5 scenarios that test context-adaptation:
## Scenario [N]: [Brief Title]
### Context
[Detailed scenario with all relevant context]
- Domain: [what domain/area]
- Scale: [size/scope indicators]
- Constraints: [time, resources, team]
- Special factors: [anything unusual]
### Task
[Specific task/problem to solve]
### Success Criteria
[What good output looks like - for later comparison]
Design scenarios to vary along key dimensions:
Scale variation: Same problem, different scale
Constraint variation: Same problem, different constraints
Context variation: Same problem, different situation
Edge cases: Valid but unusual situations
Goal: If agent applies same solution across all scenarios, reveals mechanical thinking.
Example for TDD skill:
Create expert baseline for each scenario:
You are an expert in [domain with specific expertise areas].
Apply your full expertise to this scenario. Show expert-level thinking:
- What factors would you consider first?
- What trade-offs exist?
- How would different contexts change your approach?
- What alternatives exist and why choose one over another?
Think through your reasoning explicitly before providing recommendation.
Scenario:
[Full scenario context]
Provide your expert analysis and recommendation with complete reasoning.
For each scenario, document:
Store this as the quality target for skill-guided output.
Test agent with skill being validated:
You have access to: [skill-being-tested]
Scenario:
[Identical scenario context as Agent A]
Apply the skill to address this scenario.
For each scenario, document:
Compare expert vs skill-guided outputs systematically:
You are evaluating skill quality by comparing expert baseline vs skill-guided outputs.
EXPERT OUTPUT (Agent A - what expert-level reasoning looks like):
[Agent A's full response]
SKILL-GUIDED OUTPUT (Agent B - what the skill produced):
[Agent B's full response]
Analyze gaps and generate actionable feedback:
## A. Missing Tacit Knowledge
For each knowledge gap, provide:
### Gap [N]: [Name]
**What expert considered:**
[Quote expert's reasoning verbatim]
**What skill-guided agent did:**
[Quote agent's behavior verbatim]
**Why this matters:**
[Explain impact of missing knowledge]
**Suggested skill improvement:**
[Specific, concrete addition to skill]
- Add principle: [exact wording]
- Add check: [specific prompt]
- Add example: [illustration]
---
## B. Mechanical Application Patterns
For each pattern detected across multiple scenarios:
### Pattern [N]: [Description]
**Scenarios affected:** [list]
**Agent behavior across scenarios:**
[Show same approach used inappropriately]
**Expert behavior across scenarios:**
[Show how expert adapted to context]
**Context factors agent ignored:**
[List factors that should have influenced approach]
**Suggested skill improvement:**
[Specific guidance for context-adaptation]
- Add decision matrix: [when to do what]
- Add context assessment: [factors to check]
- Add examples: [same principle, different contexts]
---
## C. Quality Delta Analysis
Score both outputs on each dimension (0-5 scale):
| Dimension | Expert | Skill | Gap | Status |
|-----------|--------|-------|-----|--------|
| Trade-off articulation | X | Y | Z | PASS/FAIL |
| Context incorporation | X | Y | Z | PASS/FAIL |
| Alternative exploration | X | Y | Z | PASS/FAIL |
| Reasoning depth | X | Y | Z | PASS/FAIL |
| Appropriateness | X | Y | Z | PASS/FAIL |
For each dimension with gap >1, provide:
### Dimension: [Name]
**Expert example:**
[Quote showing expert-level quality]
**Skill-guided example:**
[Quote showing skill-guided quality]
**Quality gap explanation:**
[What makes expert output better, specifically]
**Suggested skill improvement:**
[How to prompt for this quality dimension]
---
## Summary Report
### Overall Assessment
- Scenarios passing quality threshold (gap <1.0): X/Y
- Overall status: PASS / FAIL
### Top 3 Priority Improvements
1. [Highest impact improvement with rationale]
2. [Second priority with rationale]
3. [Third priority with rationale]
### Re-test Criteria
After improvements, success means:
- Quality gap <1.0 on all scenarios
- Context adaptation evident (solutions vary appropriately)
- Trade-off reasoning present (score >3.5 on all scenarios)
- No mechanical application patterns detected
Use generated feedback to improve skill systematically:
For each feedback item from gap analysis:
Determine what needs to be added:
Where in skill does this belong:
Write the actual skill improvement:
For missing tacit knowledge:
## [Principle Name]
Before [deciding X], always assess [factor Y]:
- If [condition A], then [approach 1] because [reasoning]
- If [condition B], then [approach 2] because [reasoning]
**Why this matters:** [Explain consequences of ignoring this]
**Example - Expert check:**
"I'd first look at [X factor] to determine whether [Y approach] is appropriate here."
**Example - Missing check:**
"Proceeding with [Y approach] without checking [X factor] first."
For mechanical application:
## Context Adaptation: [Domain Area]
The same principles apply differently in different contexts:
| Context | Approach | Rationale |
|---------|----------|-----------|
| [Context A] | [Approach 1] | [Why appropriate here] |
| [Context B] | [Approach 2] | [Why appropriate here] |
| [Context C] | [Approach 3] | [Why appropriate here] |
**Assessment questions:**
- What's the [scale/maturity/constraint] of this situation?
- What [context factors] should influence my approach?
**Red flag:** Applying [approach X] regardless of [context factor Y]
For trade-off reasoning:
## Articulating Trade-offs
For decisions under constraints, explicitly state:
**Template:**
"I'm choosing [X] over [Y] because [context factors]. This trades [cost/downside] for [benefit]. I'm accepting [trade-off] because [reasoning]. If [condition changes], I'd reconsider."
**Good example:**
[Quote expert-level trade-off reasoning]
**Poor example:**
[Quote mechanical/arbitrary decision]
**Why trade-off articulation matters:**
[Explain value of explicit reasoning]
After adding improvements:
Success criteria:
Continue improvement loop until:
This skill extends existing testing-skills-with-subagents workflow:
Patterns frequently missed by skills:
Expert knows: When principles apply vs don't apply Skill often lacks: Guidance on context recognition
Fix: Add "When This Applies" section with conditions
Expert knows: Solution complexity should match problem scale Skill often lacks: Decision matrix for different scales
Fix: Add scaling framework with context-specific approaches
Expert knows: Every decision trades something for something Skill often lacks: Prompts for explicit trade-off articulation
Fix: Add trade-off reasoning template and examples
Expert knows: Recommend what team can actually maintain Skill often lacks: Prompts to assess operational capability
Fix: Add team capability checks before solution recommendations
Expert knows: "Generally X, but when Y, then Z" Skill often lacks: Conditional guidance and exceptions
Fix: Add decision trees showing condition-dependent approaches
Indicators of template thinking:
Agent output: Same solution for 5-person startup and 1000-person enterprise Fix: Add context assessment step + decision matrix
Agent output: States conclusion without reasoning Fix: Require explicit trade-off articulation with template
Agent output: "Use X because it's good practice" Fix: Add "Why This Matters" sections explaining principles
Agent output: Ideal solution regardless of stated constraints Fix: Add constraint assessment + context-appropriate alternatives
Agent output: Enterprise architecture for MVP Fix: Add proportionality principle + scaling examples
Agent A (Expert Baseline):
This is existing, working code without tests. Test-first doesn't apply
here because the design already exists. Instead:
1. Write characterization tests documenting current behavior
2. Use tests to enable safe refactoring (not drive new design)
3. Focus on areas about to change (not comprehensive coverage)
The principle shifts from "tests drive design" to "tests enable change"
in legacy contexts. I'm adapting TDD principles to a context where
design pre-exists.
Agent B (Skill-Guided - BEFORE improvement):
Following TDD principles: Delete the legacy code and rewrite it test-first
to ensure clean design.
Gap Analysis:
Skill Improvement:
## When TDD Applies (Context Recognition)
TDD test-first principle serves different purposes in different contexts:
| Context | Approach | Purpose |
|---------|----------|---------|
| New code | Test-first | Tests drive design |
| Legacy working code | Characterization tests | Document behavior |
| Refactoring | Tests around change area | Enable safe changes |
| Prototypes | Minimal tests | Design unstable |
| Production bugs | Fix-first sometimes OK | Customer impact weighs |
**Before applying test-first, assess:**
- Does design already exist? (legacy code)
- Is design stable? (prototype vs production)
- What's the goal? (design vs documentation vs safety)
**Red flag:** "Delete working code to rewrite test-first" for legacy code
**Why:** Wastes working implementation to apply "test-first" where design exists
Agent B (Skill-Guided - AFTER improvement):
This is legacy code that already works. Test-first doesn't apply here
because design pre-exists. Instead, I'll write characterization tests
to document current behavior, then use tests to enable safe refactoring.
Focusing on areas we're about to change, not comprehensive coverage,
because the goal is "enable change safely" not "drive new design."
Result: Agent B now matches Agent A quality - context adaptation present ✓
Before claiming skill is quality-validated:
A skill is quality-validated when:
Quality validation complements pressure testing:
This skill itself should be quality-validated:
Create scenarios for skill validation:
Run expert-agent comparison on this validation skill:
If this skill passes its own validation, it's bulletproof.