Run automated evaluations of Claude Code skills against test cases, scoring outputs with judges and tracking results in MLflow for experiment management. Includes a self-improvement loop that diagnoses failures, edits skill definitions, and re-runs evaluations to detect regressions.
Analyze a skill and generate eval.yaml for the agent eval harness. Deeply examines the skill's SKILL.md, sub-skills, scripts, and test cases to produce the full evaluation config — execution mode, dataset schema, output descriptions, judges, models, and thresholds. Use this skill whenever someone wants to set up evaluation, test a skill, add quality checks, benchmark a skill, or just created a new skill and needs eval infrastructure. Also triggered automatically by /eval-run when eval.yaml is missing. Even if the user just says "how do I know if my skill is working?" — this is the right starting point.
Evaluate the full harness configuration as a system. Scans all skills, commands, CLAUDE.md, and hooks for redundancy, overlap, type misclassification, and structural issues. Produces an informational report with restructuring suggestions. Use when the user wants to check their overall setup health, find redundant skills, detect overlapping triggers, or get restructuring recommendations before diving into individual skill evaluation. Triggers on "check my setup", "harness health", "are my skills redundant", "what should I merge", "setup overview", "configuration check".
Generate evaluation test cases for a skill. Creates realistic test inputs based on skill analysis, bootstraps a starter dataset, or expands an existing one to improve coverage. Use when setting up evaluation for the first time, when the user needs test cases, when coverage is too thin, or after /eval-analyze when no dataset exists yet. Triggers on "create test cases", "generate test data", "need test inputs", "make a dataset", "add more cases", "improve coverage". Also useful when /eval-run reports "no test cases found."
MLflow integration for evaluation — sync datasets, log run results, push/pull feedback between the harness and MLflow traces. Use when the user wants to log eval results to MLflow, sync test cases to MLflow datasets, connect judge scores to traces, pull MLflow annotations for eval-optimize, or view results in the MLflow UI. Triggers on "log to mlflow", "sync dataset", "push results", "mlflow integration", "view in mlflow".
Automated skill improvement loop. Runs eval, identifies judge failures, reads traces and rationale, edits the SKILL.md to fix issues, re-runs to verify, and checks for regressions. Use when the user wants to automatically improve a skill based on eval results, fix failing judges, make the skill better, auto-fix quality issues, improve scores, or iterate until all judges pass. Triggers on "optimize the skill", "make it pass", "auto-fix", "improve the scores", "why is it failing". Works best after /eval-run has produced results to learn from.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Generic evaluation framework for agents and skills. Analyze, run, score, and improve skills automatically across different agent harness (Claude Code, OpenCode, Agent SDK).
┌──────────────────┐
┌──────────────setup────────────────▶│ MLflow Server │◀────────────┐
│ │ (local / remote) │ │
│ └──┬───────────────┘ sync, log
│ datasets feedback
│ │ │
┌───────┴──────┐ ┌───────────────┐ ┌──────────▼───┐ ┌──────────────┐ ┌────┴───────────┐
│ eval-setup │─▶│ eval-analyze │─▶│ eval-dataset │─▶│ eval-run │─▶│ eval-mlflow │
│ │ │ │ │ │ │ │ │ │
│ dependencies │ │ analyze skill │ │ generate │ │ execute eval │ │ sync dataset │
│ MLflow conf │ │ gen eval.yaml │ │ test cases │ │ collect │ │ log results │
│ directories │ │ suggest judges│ │ fill gaps │ │ score │ │ traces │
└──────────────┘ └───────────────┘ └──────────────┘ └──▲──┬─▲──┬───┘ └────────────────┘
│ │ │ │
┌─────────────┘ │ │ └────────────┐
│ ┌──────▼─┴─────┐ │
│ │ eval-review │ │
│ │ │ │
│ │ human review │ │
│ │ feedback │ │
│ └──────────────┘ │
│ │
│ ┌───────────────┐ │
└────────│ eval-optimize │◀────────┘
│ │
│ fix skill │
│ re-run │
└───────────────┘
Install from the skills registry:
claude plugin install agent-eval-harness@opendatahub-skills
Or clone and load as a local plugin:
git clone https://github.com/opendatahub-io/agent-eval-harness
pip install -e ./agent-eval-harness
claude --plugin-dir ./agent-eval-harness
This makes all eval skills available: /eval-setup, /eval-analyze, /eval-dataset, /eval-run, /eval-review, /eval-mlflow, /eval-optimize, and /eval-check.
/eval-setup
This checks dependencies, configures MLflow, verifies API keys, and creates directories.
/eval-analyze --skill my-skill
This examines the skill's SKILL.md, discovers test cases, and generates eval.yaml with:
schema descriptions of your dataset and outputs/eval-dataset
Creates 5 starter test cases based on the skill analysis. Skip this if you already have cases.
/eval-run --model opus
This prepares a workspace, runs the skill (headless or interactive), collects artifacts, scores with judges, and reports results.
The harness uses natural language to describe evaluation datasets and skills input/output and spawns LLM sub-agents to interpret them.
name: my-skill-eval
description: Evaluate the main skill pipeline
skill: my-skill-name
# Execution — how the skill processes test cases (runner-agnostic)
execution:
mode: case # case (default) or batch
arguments: "{prompt}" # resolved per case from input.yaml fields
# timeout: 3600 # Wall-clock timeout in seconds per invocation
# max_budget_usd: 5.0 # Cost cap in USD per invocation
# parallelism: 3 # Run up to N cases concurrently (case mode only)
# env: # Inject env vars into workspace settings
# JIRA_SERVER: http://localhost:8080 # Literal value
# JIRA_TOKEN: $JIRA_TOKEN # $VAR resolved from caller's env
npx claudepluginhub opendatahub-io/agent-eval-harness --plugin agent-eval-harnessSite generation skills for the OpenDataHub Skills Registry
Assess RFEs against quality criteria using a structured rubric
Orchestrator skills, agent prompts, and state management for the Jira autofix pipeline
Documentation review, writing, and workflow tools for AsciiDoc and Markdown documentation
A plugin providing a subagent to scan a source code project for potential FIPS compliance issues
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
Comprehensive model evaluation with multiple metrics
AI agent evaluation toolkit for Copilot Studio. Plan evals, generate test cases, interpret results, and triage failures — grounded in Microsoft's Eval Scenario Library and Triage & Improvement Playbook.
Representation Synthesis workflow for auditing agent skills in Claude Code.
Skills for tracing, evaluating, and improving AI agents with MLflow. Supports the full agent improvement loop: instrument → trace → evaluate → iterate → validate.
Self-evolving skill engine for Claude Code. Creates, scores, repairs, and hardens skills autonomously through recursive improvement cycles.