From outputai
Audits an eval suite for trustworthiness by checking error analysis grounding, evaluator design, judge validation, and dataset coverage. Use when inheriting evals, suspecting missed failures, or after pipeline changes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/outputai:output-eval-auditThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Audit your eval suite to determine whether it actually catches real failures. This skill provides a structured diagnostic that identifies gaps in error analysis, evaluator design, judge validation, and dataset coverage, with concrete remediation steps for each finding.
Audit your eval suite to determine whether it actually catches real failures. This skill provides a structured diagnostic that identifies gaps in error analysis, evaluator design, judge validation, and dataset coverage, with concrete remediation steps for each finding.
Read the eval infrastructure files for the workflow being audited:
src/workflows/<workflow_name>/
├── tests/
│ ├── datasets/ # YAML dataset files
│ │ ├── *.yml
│ │ └── ...
│ └── evals/
│ ├── evaluators.ts # Evaluator definitions
│ ├── workflow.ts # Eval workflow definition
│ └── *.prompt # Judge prompt files
Inventory what exists:
| Artifact | File(s) | Count |
|---|---|---|
| Evaluators | tests/evals/evaluators.ts | ? |
| Eval workflow | tests/evals/workflow.ts | ? entries in evals array |
| Judge prompts | tests/evals/*.prompt | ? |
| Datasets | tests/datasets/*.yml | ? |
| Datasets with ground_truth | ? of above | ? |
| Datasets with last_output | ? of above | ? |
If any of these are missing entirely, note it and skip to "Starting From Zero" at the bottom.
Evaluate each of the four areas below. For each, assign a status:
Question: Were the evaluators derived from observed failure modes in real workflow traces?
Check:
Pass criteria:
Common failures:
evaluate_quality, check_overall, rate_output — generic, not grounded in observed failuresRemediation: output-eval-error-analysis — Review 50+ traces and categorize actual failure modes before modifying evaluators
Question: Are the evaluators well-designed for reliable automated evaluation?
Check each evaluator in tests/evals/evaluators.ts:
| Check | What to look for |
|---|---|
| One failure mode per judge | Each judgeVerdict() evaluator targets exactly one criterion |
| Binary verdicts | Judge prompts use pass/fail, not Likert scales (1-5) or multi-axis ratings |
| Code-based where possible | Objective checks use Verdict.* helpers, not LLM judges |
| Few-shot examples in judges | Judge .prompt files include pass, fail, and borderline examples |
| Critique before verdict | Judge prompts request critique/reasoning before the verdict in structured output |
| Appropriate criticality | required for blocking failures, informational for nice-to-have checks |
| Correct interpret type | interpret config matches what the evaluator returns |
Pass criteria:
Common failures:
Verdict.*interpret type doesn't match evaluator return type (e.g., judgeVerdict() with interpret: { type: 'boolean' })Remediation: output-eval-judge-prompt — Redesign judge prompts following the four-component structure
Question: Have LLM judges been validated against human labels?
Check for each LLM-based evaluator (those using judgeVerdict(), judgeScore(), judgeLabel()):
| Check | What to look for |
|---|---|
| Human labels exist | Datasets have ground_truth.evals.<evaluator_name>.verdict populated |
| TPR/TNR measured | Validation results documented (file, comment, or commit) |
| Train/dev/test split | Few-shot examples in the judge prompt come from a designated train split, not from the same data used for measurement |
| Metrics meet threshold | TPR > 80% and TNR > 80% (target: > 90%) |
Pass criteria:
Common failures:
Remediation: output-eval-validate-judge — Calibrate each judge against human labels using TPR/TNR
Question: Do the datasets adequately cover the failure space?
Check:
| Check | What to look for |
|---|---|
| Dataset count | Minimum 10 for simple workflows, 20+ for complex ones |
| Diversity | Datasets vary across multiple input dimensions, not just happy paths |
| Failure representation | At least 30% of datasets have human_verdict: fail in ground_truth |
| Ground truth populated | Most datasets have ground_truth with per-evaluator labels |
| Real + synthetic mix | Includes production traces alongside synthetic test cases |
| No near-duplicates | Each dataset tests a meaningfully different scenario |
Pass criteria:
Common failures:
Remediation: output-eval-dataset-design — Design diverse datasets using dimension-based variation
Summarize findings in a structured format:
# Eval Audit: <workflow_name>
# Date: YYYY-MM-DD
# Auditor: <name>
## Summary
| Area | Status | Key Finding |
|------|--------|-------------|
| Error Analysis Grounding | Warn | Evaluators seem reasonable but no documented trace review |
| Evaluator Design | Fail | Single judge evaluates 3 criteria simultaneously |
| Judge Validation | Fail | No validation performed on any LLM judge |
| Dataset Coverage | Warn | 12 datasets but only 2 are failure cases |
## Findings
### 1. Error Analysis Grounding — WARN
Evaluators target reasonable criteria (tone, topic, length) but there is no evidence
that these were derived from observed failures. The eval suite may be missing the
workflow's actual top failure modes.
**Next step:** Run error analysis on 50+ production traces (`output-eval-error-analysis`)
### 2. Evaluator Design — FAIL
`evaluate_overall_quality` in evaluators.ts uses a single judgeVerdict() call that
assesses tone, accuracy, and completeness simultaneously. This makes failures
unactionable — when it fails, you don't know which criterion failed.
**Next step:** Split into three focused judges (`output-eval-judge-prompt`)
### 3. Judge Validation — FAIL
No TPR/TNR metrics exist for any LLM judge. The [email protected] has no
few-shot examples.
**Next step:** Label 100 datasets, validate each judge (`output-eval-validate-judge`)
### 4. Dataset Coverage — WARN
12 datasets exist with cached output. Only 2 have ground_truth.human_verdict: fail.
All inputs are simple topics with no edge cases.
**Next step:** Design 20+ diverse datasets (`output-eval-dataset-design`)
## Priority Order
1. Error analysis (foundational — may change which evaluators are needed)
2. Split holistic judge into focused judges
3. Expand datasets to 30+ with balanced pass/fail
4. Validate all LLM judges
If the workflow has no eval infrastructure at all:
output-eval-error-analysis. Review 50+ workflow traces.output-eval-dataset-design. Create 20+ diverse datasets.output-dev-eval-testing. Write verify() evaluators and evalWorkflow().output-eval-judge-prompt. For subjective criteria only.output-eval-validate-judge. Before trusting any LLM judge.Do not skip error analysis. Building evaluators without understanding how the workflow fails wastes effort on the wrong things.
output-eval-error-analysis — Systematic trace review and failure categorizationoutput-eval-judge-prompt — Design effective LLM judge promptsoutput-eval-dataset-design — Generate diverse test datasetsoutput-eval-validate-judge — Calibrate LLM judges against human labelsoutput-dev-eval-testing — Implementation reference for offline eval testingoutput-dev-evaluator-function — Implementation reference for runtime evaluatorsnpx claudepluginhub growthxai/output --plugin outputaiAudits LLM eval pipelines for issues like missing error analysis, unvalidated judges, and vanity metrics. Produces prioritized findings with fixes when inheriting systems or verifying trustworthiness.
Reviews workflow execution traces to identify failure modes before building evaluators. Use when starting eval projects, after pipeline changes, or when production quality drops.
Use this skill when the user asks to "design an eval suite", "build evals for my AI feature", "create an evaluation framework", "how do I evaluate my AI", "what evals should I run", "build an eval system", or wants to create a systematic evaluation framework for an AI-powered product feature. Typically run after error-analysis has identified the failure categories to prioritize.