Search everything...

Stats

Actions

Available In

agent-eval-harness

Name: agent-eval-harness
Author: opendatahub-io

By opendatahub-io

Run automated evaluations of Claude Code skills against test cases, scoring outputs with judges and tracking results in MLflow for experiment management. Includes a self-improvement loop that diagnoses failures, edits skill definitions, and re-runs evaluations to detect regressions.

npx claudepluginhub opendatahub-io/agent-eval-harness --plugin agent-eval-harness

Popularity

Stars

Top 25%

Med: 0·Avg: 281

Installs

Top 5%

Med: 0·Avg: 1

What's Inside

Skills8

eval-analyze

/eval-analyze

Analyze a skill and generate eval.yaml for the agent eval harness. Deeply examines the skill's SKILL.md, sub-skills, scripts, and test cases to produce the full evaluation config — execution mode, dataset schema, output descriptions, judges, models, and thresholds. Use this skill whenever someone wants to set up evaluation, test a skill, add quality checks, benchmark a skill, or just created a new skill and needs eval infrastructure. Also triggered automatically by /eval-run when eval.yaml is missing. Even if the user just says "how do I know if my skill is working?" — this is the right starting point.

eval-check

/eval-check

Evaluate the full harness configuration as a system. Scans all skills, commands, CLAUDE.md, and hooks for redundancy, overlap, type misclassification, and structural issues. Produces an informational report with restructuring suggestions. Use when the user wants to check their overall setup health, find redundant skills, detect overlapping triggers, or get restructuring recommendations before diving into individual skill evaluation. Triggers on "check my setup", "harness health", "are my skills redundant", "what should I merge", "setup overview", "configuration check".

eval-dataset

/eval-dataset

Generate evaluation test cases for a skill. Creates realistic test inputs based on skill analysis, bootstraps a starter dataset, or expands an existing one to improve coverage. Use when setting up evaluation for the first time, when the user needs test cases, when coverage is too thin, or after /eval-analyze when no dataset exists yet. Triggers on "create test cases", "generate test data", "need test inputs", "make a dataset", "add more cases", "improve coverage". Also useful when /eval-run reports "no test cases found."

eval-mlflow

/eval-mlflow

MLflow integration for evaluation — sync datasets, log run results, push/pull feedback between the harness and MLflow traces. Use when the user wants to log eval results to MLflow, sync test cases to MLflow datasets, connect judge scores to traces, pull MLflow annotations for eval-optimize, or view results in the MLflow UI. Triggers on "log to mlflow", "sync dataset", "push results", "mlflow integration", "view in mlflow".

eval-optimize

/eval-optimize

Automated skill improvement loop. Runs eval, identifies judge failures, reads traces and rationale, edits the SKILL.md to fix issues, re-runs to verify, and checks for regressions. Use when the user wants to automatically improve a skill based on eval results, fix failing judges, make the skill better, auto-fix quality issues, improve scores, or iterate until all judges pass. Triggers on "optimize the skill", "make it pass", "auto-fix", "improve the scores", "why is it failing". Works best after /eval-run has produced results to learn from.

Hooks1

Event Hooks

1 hook across 1 event

Stats

Version1.14.2

ReleasedJun 24, 2026

LanguagePython

Stars30

Forks29

Copy clicks2

MaintenanceExcellent

LicenseApache-2.0

Last CommitJun 24, 2026

AddedApr 13, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Own this plugin?

Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).

Available In

agent-eval-harness-dev30 opendatahub-skills18

README

Agent Eval Harness

Generic evaluation framework for agents and skills. Analyze, run, score, and improve skills automatically across different agent harness (Claude Code, OpenCode, Agent SDK).

Overview

                                             ┌──────────────────┐
        ┌──────────────setup────────────────▶│  MLflow Server   │◀────────────┐
        │                                    │ (local / remote) │             │
        │                                    └──┬───────────────┘          sync, log
        │                                    datasets                      feedback
        │                                       │                             │
┌───────┴──────┐  ┌───────────────┐  ┌──────────▼───┐  ┌──────────────┐  ┌────┴───────────┐
│  eval-setup  │─▶│ eval-analyze  │─▶│ eval-dataset │─▶│   eval-run   │─▶│  eval-mlflow   │
│              │  │               │  │              │  │              │  │                │
│ dependencies │  │ analyze skill │  │ generate     │  │ execute eval │  │ sync dataset   │
│ MLflow conf  │  │ gen eval.yaml │  │ test cases   │  │ collect      │  │ log results    │
│ directories  │  │ suggest judges│  │ fill gaps    │  │ score        │  │ traces         │
└──────────────┘  └───────────────┘  └──────────────┘  └──▲──┬─▲──┬───┘  └────────────────┘
                                                          │  │ │  │
                                            ┌─────────────┘  │ │  └────────────┐
                                            │         ┌──────▼─┴─────┐         │
                                            │         │ eval-review  │         │
                                            │         │              │         │
                                            │         │ human review │         │
                                            │         │ feedback     │         │
                                            │         └──────────────┘         │
                                            │                                  │
                                            │        ┌───────────────┐         │
                                            └────────│ eval-optimize │◀────────┘
                                                     │               │
                                                     │ fix skill     │
                                                     │ re-run        │
                                                     └───────────────┘

Quick Start

1. Add to your project

Install from the skills registry:

claude plugin install agent-eval-harness@opendatahub-skills

Or clone and load as a local plugin:

git clone https://github.com/opendatahub-io/agent-eval-harness
pip install -e ./agent-eval-harness
claude --plugin-dir ./agent-eval-harness

This makes all eval skills available: /eval-setup, /eval-analyze, /eval-dataset, /eval-run, /eval-review, /eval-mlflow, /eval-optimize, and /eval-check.

2. Set up environment

/eval-setup

This checks dependencies, configures MLflow, verifies API keys, and creates directories.

3. Analyze your skill

/eval-analyze --skill my-skill

This examines the skill's SKILL.md, discovers test cases, and generates eval.yaml with:

Natural language schema descriptions of your dataset and outputs
Suggested judges (inline checks + LLM quality assessment)
Regression thresholds

4. Generate test cases (if needed)

/eval-dataset

Creates 5 starter test cases based on the skill analysis. Skip this if you already have cases.

5. Run evaluation

/eval-run --model opus

This prepares a workspace, runs the skill (headless or interactive), collects artifacts, scores with judges, and reports results.

eval.yaml

The harness uses natural language to describe evaluation datasets and skills input/output and spawns LLM sub-agents to interpret them.

name: my-skill-eval
description: Evaluate the main skill pipeline
skill: my-skill-name

# Execution — how the skill processes test cases (runner-agnostic)
execution:
  mode: case              # case (default) or batch
  arguments: "{prompt}"   # resolved per case from input.yaml fields
  # timeout: 3600            # Wall-clock timeout in seconds per invocation
  # max_budget_usd: 5.0      # Cost cap in USD per invocation
  # parallelism: 3            # Run up to N cases concurrently (case mode only)
  # env:                     # Inject env vars into workspace settings
  #   JIRA_SERVER: http://localhost:8080   # Literal value
  #   JIRA_TOKEN: $JIRA_TOKEN              # $VAR resolved from caller's env

View full README on GitHub

agent-eval-harness

Popularity

What's Inside

Confidence

README

Agent Eval Harness

Overview

Quick Start

1. Add to your project

2. Set up environment

3. Analyze your skill

4. Generate test cases (if needed)

5. Run evaluation

eval.yaml

Similar Plugins

skill-optimizer

model-evaluation-suite

eval-guide

semia

MLflow Skills

singularity-claude

More by opendatahub-io

skills-registry

assess-rfe

autofix-skills

docs-skills

fips-compliance-checker

Agent Eval Harness

Overview

Quick Start

1. Add to your project

2. Set up environment

3. Analyze your skill

4. Generate test cases (if needed)

5. Run evaluation

eval.yaml

Popularity

Health & Quality

More by opendatahub-io

skills-registry

assess-rfe

autofix-skills

docs-skills

fips-compliance-checker

Similar Plugins

skill-optimizer

model-evaluation-suite

eval-guide

semia

MLflow Skills

singularity-claude