From mycelium
Runs evaluation scenarios to benchmark agent performance via reflexion loops, validates success criteria, records metrics, generates reports, and proposes new evals from logs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/mycelium:eval-runnerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system.
Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system.
run <category/name>.claude/evals/scenarios/<category>/<name>.yml.claude/evals/results/<timestamp>-<name>.jsonrun-all [category].claude/evals/scenarios/**/*.ymlstatus: retired.claude/evals/pass-history.json with each resultrun-split <optimization|holdout>.claude/evals/scenarios/**/*.ymlsplit field matching the requested setstatus: retired.claude/evals/pass-history.json with each resultreport.claude/evals/results/| Category | Pass Rate | Avg Iterations | Avg Time | Notes |
|-------------|-----------|----------------|----------|-------|
| discovery | ... | ... | ... | |
| delivery | ... | ... | ... | |
| integration | ... | ... | ... | |
| **Overall** | ... | ... | ... | |
prune.claude/evals/pass-history.jsonlast_5 is all-pass (saturated) or all-fail (broken)stale)status: retired in scenario YAML, update pass-history.json, log in .claude/harness/decision-log.mdmineAnalyze audit logs to propose new eval scenarios from observed failure patterns.
.claude/state/change-log.jsonl (last 100 entries).claude/state/diamond-state-audit.jsonl (all entries)session_id, identify:
a. Correction clusters: 3+ edits to same file in one session (agent struggled)
b. Skill friction: edits to .claude/skills/*/SKILL.md during a session (instructions unclear)
c. Missing test coverage: 5+ files changed with no test file editssource: trace-mining and originating session_idSee .claude/evals/schema.md §Trace Mining Heuristics for pattern-to-eval mappings.
See .claude/evals/schema.md for YAML scenario and JSON result formats.
After writing each result JSON (step 8 of run), also update .claude/evals/pass-history.json:
runs and passes (if passed) for the evaltrue/false to last_5 (trim to keep only last 5)last_run timestampPlace YAML files in .claude/evals/scenarios/<category>/. Define task_prompt, success_criteria, and budget. Set split, status, and source fields per schema.
npx claudepluginhub haabe/mycelium --plugin myceliumAnswers AI agent evaluation methodology questions using Microsoft's agent evaluation ecosystem, covering grader types, dataset design, criteria writing, non-determinism, tool-call evaluation, and multi-turn agents.
Evaluates and improves GenAI agent output quality using MLflow's native APIs for datasets, scorers, and tracing. Covers end-to-end evaluation workflow or individual components.
Runs evaluations on ADK agents: writing eval datasets, analyzing failures, comparing results, and optimizing agents using the Quality Flywheel methodology.