Skill

output-eval-audit

Audits an eval suite for trustworthiness by checking error analysis grounding, evaluator design, judge validation, and dataset coverage. Use when inheriting evals, suspecting missed failures, or after pipeline changes.

testing

Popularity

Parent stars

422

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/outputai:output-eval-audit

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

BashRead

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Audit your eval suite to determine whether it actually catches real failures. This skill provides a structured diagnostic that identifies gaps in error analysis, evaluator design, judge validation, and dataset coverage, with concrete remediation steps for each finding.

SKILL.md

240 lines · ~2.5k tokens

Stats

LanguageJavaScript

Parent stars422

Parent forks12

MaintenanceExcellent

Last CommitJun 23, 2026

Actions

View Source View Plugin View on GitHub View README

Auditing an Eval Suite

Overview

When to Use

Inheriting an eval suite from another team or developer
Suspecting that evals pass but production quality is poor
After switching models, rewriting prompts, or changing pipeline logic
Periodic health check (quarterly or after major releases)

Step 1: Gather Artifacts

Read the eval infrastructure files for the workflow being audited:

src/workflows/<workflow_name>/
├── tests/
│   ├── datasets/           # YAML dataset files
│   │   ├── *.yml
│   │   └── ...
│   └── evals/
│       ├── evaluators.ts   # Evaluator definitions
│       ├── workflow.ts      # Eval workflow definition
│       └── *.prompt         # Judge prompt files

Inventory what exists:

Artifact	File(s)	Count
Evaluators	`tests/evals/evaluators.ts`	?
Eval workflow	`tests/evals/workflow.ts`	? entries in `evals` array
Judge prompts	`tests/evals/*.prompt`	?
Datasets	`tests/datasets/*.yml`	?
Datasets with ground_truth	? of above	?
Datasets with last_output	? of above	?

If any of these are missing entirely, note it and skip to "Starting From Zero" at the bottom.

Step 2: Run the Diagnostic

Evaluate each of the four areas below. For each, assign a status:

Pass — Meets the standard
Warn — Partially meets the standard, improvements needed
Fail — Does not meet the standard, significant risk

Area 1: Error Analysis Grounding

Question: Were the evaluators derived from observed failure modes in real workflow traces?

Check:

Do failure categories exist (documented in a file, comments, or commit history)?
Does each evaluator map to a specific failure category?
Or are evaluators measuring generic qualities ("quality score", "overall rating")?

Pass criteria:

Each evaluator targets a named failure mode (e.g., "check_tone" targets tone mismatch, not "evaluate general quality")
Failure categories were derived from reviewing real traces (not brainstormed)

Common failures:

Evaluators named evaluate_quality, check_overall, rate_output — generic, not grounded in observed failures
Evaluators were written based on what seemed important, not what actually fails
No evidence of trace review before evaluator creation

Remediation: output-eval-error-analysis — Review 50+ traces and categorize actual failure modes before modifying evaluators

Area 2: Evaluator Design

Question: Are the evaluators well-designed for reliable automated evaluation?

Check each evaluator in tests/evals/evaluators.ts:

Check	What to look for
One failure mode per judge	Each `judgeVerdict()` evaluator targets exactly one criterion
Binary verdicts	Judge prompts use pass/fail, not Likert scales (1-5) or multi-axis ratings
Code-based where possible	Objective checks use `Verdict.*` helpers, not LLM judges
Few-shot examples in judges	Judge `.prompt` files include pass, fail, and borderline examples
Critique before verdict	Judge prompts request critique/reasoning before the verdict in structured output
Appropriate criticality	`required` for blocking failures, `informational` for nice-to-have checks
Correct interpret type	`interpret` config matches what the evaluator returns

Pass criteria:

All checks above are met for every evaluator

Common failures:

A single judge prompt evaluates 3+ criteria simultaneously ("Rate tone, accuracy, and completeness")
Judge prompts have no few-shot examples
Deterministic checks (length, string contains, regex) use LLM judges instead of Verdict.*
interpret type doesn't match evaluator return type (e.g., judgeVerdict() with interpret: { type: 'boolean' })

Remediation: output-eval-judge-prompt — Redesign judge prompts following the four-component structure

Area 3: Judge Validation

Question: Have LLM judges been validated against human labels?

Check for each LLM-based evaluator (those using judgeVerdict(), judgeScore(), judgeLabel()):

Check	What to look for
Human labels exist	Datasets have `ground_truth.evals.<evaluator_name>.verdict` populated
TPR/TNR measured	Validation results documented (file, comment, or commit)
Train/dev/test split	Few-shot examples in the judge prompt come from a designated train split, not from the same data used for measurement
Metrics meet threshold	TPR > 80% and TNR > 80% (target: > 90%)

Pass criteria:

Every LLM judge has documented TPR/TNR metrics above 80%
Train/dev/test split was used (no data leakage)

Common failures:

No validation at all — judges were written and immediately deployed
Few-shot examples in the judge prompt are the same examples used to measure metrics (data leakage)
"It seems to work" without quantitative measurement
Only raw accuracy reported (masks class imbalance)

Remediation: output-eval-validate-judge — Calibrate each judge against human labels using TPR/TNR

Area 4: Dataset Coverage

Question: Do the datasets adequately cover the failure space?

Check:

Check	What to look for
Dataset count	Minimum 10 for simple workflows, 20+ for complex ones
Diversity	Datasets vary across multiple input dimensions, not just happy paths
Failure representation	At least 30% of datasets have `human_verdict: fail` in ground_truth
Ground truth populated	Most datasets have `ground_truth` with per-evaluator labels
Real + synthetic mix	Includes production traces alongside synthetic test cases
No near-duplicates	Each dataset tests a meaningfully different scenario

Pass criteria:

20+ diverse datasets with ground truth
Both pass and fail cases represented (not 95% passes)
Datasets cover different input dimensions

Common failures:

Only 3-5 datasets, all happy-path variations
100% of datasets pass (no failure cases to validate judges against)
Datasets are synthetic-only with no real production traces
Ground truth fields are empty or missing

Remediation: output-eval-dataset-design — Design diverse datasets using dimension-based variation

Step 3: Compile the Report

Summarize findings in a structured format:

# Eval Audit: <workflow_name>
# Date: YYYY-MM-DD
# Auditor: <name>

## Summary

| Area | Status | Key Finding |
|------|--------|-------------|
| Error Analysis Grounding | Warn | Evaluators seem reasonable but no documented trace review |
| Evaluator Design | Fail | Single judge evaluates 3 criteria simultaneously |
| Judge Validation | Fail | No validation performed on any LLM judge |
| Dataset Coverage | Warn | 12 datasets but only 2 are failure cases |

## Findings

### 1. Error Analysis Grounding — WARN
Evaluators target reasonable criteria (tone, topic, length) but there is no evidence
that these were derived from observed failures. The eval suite may be missing the
workflow's actual top failure modes.

**Next step:** Run error analysis on 50+ production traces (`output-eval-error-analysis`)

### 2. Evaluator Design — FAIL
`evaluate_overall_quality` in evaluators.ts uses a single judgeVerdict() call that
assesses tone, accuracy, and completeness simultaneously. This makes failures
unactionable — when it fails, you don't know which criterion failed.

**Next step:** Split into three focused judges (`output-eval-judge-prompt`)

### 3. Judge Validation — FAIL
No TPR/TNR metrics exist for any LLM judge. The [email protected] has no
few-shot examples.

**Next step:** Label 100 datasets, validate each judge (`output-eval-validate-judge`)

### 4. Dataset Coverage — WARN
12 datasets exist with cached output. Only 2 have ground_truth.human_verdict: fail.
All inputs are simple topics with no edge cases.

**Next step:** Design 20+ diverse datasets (`output-eval-dataset-design`)

## Priority Order
1. Error analysis (foundational — may change which evaluators are needed)
2. Split holistic judge into focused judges
3. Expand datasets to 30+ with balanced pass/fail
4. Validate all LLM judges

Starting From Zero

If the workflow has no eval infrastructure at all:

Start with error analysis — output-eval-error-analysis. Review 50+ workflow traces.
Build datasets — output-eval-dataset-design. Create 20+ diverse datasets.
Implement evaluators — output-dev-eval-testing. Write verify() evaluators and evalWorkflow().
Design judge prompts — output-eval-judge-prompt. For subjective criteria only.
Validate judges — output-eval-validate-judge. Before trusting any LLM judge.

Do not skip error analysis. Building evaluators without understanding how the workflow fails wastes effort on the wrong things.

Related Skills

output-eval-error-analysis — Systematic trace review and failure categorization
output-eval-judge-prompt — Design effective LLM judge prompts
output-eval-dataset-design — Generate diverse test datasets
output-eval-validate-judge — Calibrate LLM judges against human labels
output-dev-eval-testing — Implementation reference for offline eval testing
output-dev-evaluator-function — Implementation reference for runtime evaluators

output-eval-audit

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

output-eval-audit

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Auditing an Eval Suite

Overview

When to Use

Step 1: Gather Artifacts

Step 2: Run the Diagnostic

Area 1: Error Analysis Grounding

Area 2: Evaluator Design

Area 3: Judge Validation

Area 4: Dataset Coverage

Step 3: Compile the Report

Starting From Zero

Related Skills

Similar Skills

Auditing an Eval Suite

Overview

When to Use

Step 1: Gather Artifacts

Step 2: Run the Diagnostic

Area 1: Error Analysis Grounding

Area 2: Evaluator Design

Area 3: Judge Validation

Area 4: Dataset Coverage

Step 3: Compile the Report

Starting From Zero

Related Skills

Similar Skills