Skill

eval-runner

Runs evaluation scenarios to benchmark agent performance via reflexion loops, validates success criteria, records metrics, generates reports, and proposes new evals from logs.

testing

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/mycelium:eval-runner

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system.

SKILL.md

89 lines · ~987 tokens

Stats

LanguagePython

Parent stars27

Parent forks3

MaintenanceExcellent

Last CommitMay 9, 2026

Actions

View Source View Plugin View on GitHub View README

Eval Runner

Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system.

Commands

`run <category/name>`

Read YAML from .claude/evals/scenarios/<category>/<name>.yml
Parse fields (name, category, task_prompt, success_criteria, budget)
Execute setup steps if defined
Record start time
Execute task via reflexion workflow (read corrections first)
Record end time and iteration count
Validate ALL success criteria
Write result JSON to .claude/evals/results/<timestamp>-<name>.json
Report summary

`run-all [category]`

Glob .claude/evals/scenarios/**/*.yml
Skip scenarios with status: retired
For each: run in isolation (git stash), record result, restore
Update .claude/evals/pass-history.json with each result
Aggregate and report

`run-split <optimization|holdout>`

Glob .claude/evals/scenarios/**/*.yml
Read each YAML, filter by split field matching the requested set
Skip scenarios with status: retired
For each matching scenario: run in isolation, record result, restore
Update .claude/evals/pass-history.json with each result
Aggregate and report (label output clearly as "Optimization Set" or "Holdout Set")

`report`

Read all results from .claude/evals/results/
Generate summary table:

| Category    | Pass Rate | Avg Iterations | Avg Time | Notes |
|-------------|-----------|----------------|----------|-------|
| discovery   | ...       | ...            | ...      |       |
| delivery    | ...       | ...            | ...      |       |
| integration | ...       | ...            | ...      |       |
| **Overall** | ...       | ...            | ...      |       |

List failure patterns and recommendations

`prune`

Read .claude/evals/pass-history.json
Flag evals where last_5 is all-pass (saturated) or all-fail (broken)
Flag evals with no runs in 30+ days (stale)
For saturated evals, suggest: retire or increase difficulty
For broken evals, suggest: fix criteria or retire
Present recommendations — do NOT auto-retire
On user confirmation: set status: retired in scenario YAML, update pass-history.json, log in .claude/harness/decision-log.md

`mine`

Analyze audit logs to propose new eval scenarios from observed failure patterns.

Read .claude/state/change-log.jsonl (last 100 entries)
Read .claude/state/diamond-state-audit.jsonl (all entries)
Group change-log entries by session_id, identify: a. Correction clusters: 3+ edits to same file in one session (agent struggled) b. Skill friction: edits to .claude/skills/*/SKILL.md during a session (instructions unclear) c. Missing test coverage: 5+ files changed with no test file edits
Count diamond bypass entries from audit log
Cross-reference with existing scenarios in pass-history.json to avoid duplicates
For each pattern, output a proposed eval scenario as YAML template
Tag proposals with source: trace-mining and originating session_id
Do NOT auto-create files — present for human review

See .claude/evals/schema.md §Trace Mining Heuristics for pattern-to-eval mappings.

Result Format

See .claude/evals/schema.md for YAML scenario and JSON result formats.

Pass History

After writing each result JSON (step 8 of run), also update .claude/evals/pass-history.json:

Increment runs and passes (if passed) for the eval
Append true/false to last_5 (trim to keep only last 5)
Update last_run timestamp

Creating Scenarios

Place YAML files in .claude/evals/scenarios/<category>/. Define task_prompt, success_criteria, and budget. Set split, status, and source fields per schema.

eval-runner

Popularity

Invocation

Context Preview

SKILL.md

eval-runner

Popularity

Invocation

Context Preview

SKILL.md

Eval Runner

Commands

`run <category/name>`

`run-all [category]`

`run-split <optimization|holdout>`

`report`

`prune`

`mine`

Result Format

Pass History

Creating Scenarios

Similar Skills

Eval Runner

Commands

`run <category/name>`

`run-all [category]`

`run-split <optimization|holdout>`

`report`

`prune`

`mine`

Result Format

Pass History

Creating Scenarios

Similar Skills