From outputai
Reviews workflow execution traces to identify failure modes before building evaluators. Use when starting eval projects, after pipeline changes, or when production quality drops.
How this skill is triggered — by the user, by Claude, or both
Slash command
/outputai:output-eval-error-analysisThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Review real workflow traces and categorize how your workflow fails **before** writing any evaluators. Evaluators built without error analysis target generic qualities ("is this good?") instead of the specific ways your workflow actually breaks. This skill walks you through the process.
Review real workflow traces and categorize how your workflow fails before writing any evaluators. Evaluators built without error analysis target generic qualities ("is this good?") instead of the specific ways your workflow actually breaks. This skill walks you through the process.
Gather 50-100 representative workflow executions. More traces = more reliable failure categories.
List recent workflow executions and pull their traces:
# List recent runs for a workflow
npx output workflow runs list <workflowName>
# Pull a specific trace as JSON
npx output workflow debug <workflowId> --json
Download production traces directly into dataset YAML files:
# Download up to 20 recent traces as dataset files
npx output workflow dataset generate <workflowName> --download --limit 20
This creates YAML files in tests/datasets/ with the input and last_output fields populated from real executions.
If production traces are sparse, generate traces from scenario inputs:
# Generate a dataset from a scenario file
npx output workflow dataset generate <workflowName> basic --name basic_trace
# Generate from inline JSON
npx output workflow dataset generate <workflowName> --input '{"topic": "AI safety"}' --name ai_safety_trace
Run enough inputs to get 50+ traces. Prioritize diversity over volume — vary inputs across the dimensions you expect to matter.
Review each trace one at a time. For each trace, record:
| Field | What to write |
|---|---|
| Trace ID | The workflow execution ID |
| Verdict | Pass or Fail (binary — no "partial" at this stage) |
| Root cause | If Fail: what specifically went wrong and why |
| Notes | Anything surprising or worth remembering |
Create a file to track your reviews. A simple markdown table works:
# Error Analysis: <workflow_name>
# Date: YYYY-MM-DD
# Traces reviewed: 0 / 50
| # | Trace ID | Verdict | Root Cause | Notes |
|---|----------|---------|------------|-------|
| 1 | abc-123 | Fail | Hallucinated a URL that doesn't exist | Common with technical topics |
| 2 | def-456 | Pass | — | Clean output |
| 3 | ghi-789 | Fail | Ignored the "formal tone" requirement | Input had conflicting signals |
Open the JSON trace and examine:
Review at least 30 traces before naming any failure categories. Premature categorization causes you to see patterns that aren't there and miss patterns that are. Just record what you observe.
After reviewing 30+ traces, patterns will emerge. Group your failures into 5-10 categories based on root cause, not surface symptoms.
For a blog generation workflow after reviewing 60 traces:
| Category | Count | Rate | Example |
|---|---|---|---|
| Hallucinated URLs | 8 | 13% | Invented links to non-existent pages |
| Tone mismatch | 6 | 10% | Casual tone when formal was requested |
| Off-topic drift | 5 | 8% | Blog about "AI" drifted to unrelated ML history |
| Missing sections | 4 | 7% | Skipped "conclusion" when explicitly requested |
| Too short | 3 | 5% | Under 200 words when 500+ requested |
| Total failures | 26 | 43% | |
| Passes | 34 | 57% |
Add ground_truth labels to your dataset YAML files so evaluators can validate against them. Each failure category maps to a future evaluator name.
name: ai_safety_trace
input:
topic: "AI safety"
tone: "formal"
min_length: 500
last_output:
output:
title: "Understanding AI Safety"
blog_post: "AI safety is super important and stuff..."
executionTimeMs: 3200
date: '2026-03-25T00:00:00.000Z'
ground_truth:
# Global ground truth (available to all evaluators)
human_verdict: fail
failure_categories:
- tone_mismatch
notes: "Used casual language despite formal tone request"
# Per-evaluator ground truth
evals:
check_tone:
expected_tone: formal
verdict: fail
check_length:
min_length: 500
verdict: pass
check_hallucinated_urls:
verdict: pass
The ground_truth.evals.<evaluator_name> fields map directly to the evaluator names you'll use in verify(). Each evaluator receives its own ground truth merged with the top-level ground truth via context.ground_truth.
You don't need to label every dataset for every category. Focus on:
human_verdict (pass/fail)Not every failure category needs an evaluator. Use this decision tree:
Is this failure caused by a fixable prompt/tool gap?
├─ YES → Fix the prompt or add the missing tool first
│ Re-run error analysis after the fix
└─ NO → Will this failure recur and need ongoing monitoring?
├─ YES → Build an evaluator
│ Can it be checked with deterministic code?
│ ├─ YES → Use Verdict.* helpers (contains, matches, gte, etc.)
│ └─ NO → Use judgeVerdict() with an LLM judge prompt
└─ NO → Document it and move on (rare edge case)
Build evaluators for the highest-rate failure categories first. A failure at 13% matters more than one at 2%.
Many failures that seem subjective have objective proxies:
| Failure | Seems like... | But you can check with... |
|---|---|---|
| "Too short" | Subjective | Verdict.gte(output.length, threshold) |
| "Missing section" | Needs LLM | Verdict.contains(output, "## Conclusion") |
| "Hallucinated URLs" | Needs LLM | Extract URLs with regex, verify with HTTP HEAD |
| "Wrong format" | Needs LLM | Verdict.matches(output, expectedPattern) |
Reserve LLM judges for genuinely subjective criteria: tone, relevance, faithfulness, coherence.
Create a mapping document that connects your failure categories to planned evaluators:
# Evaluator Plan: blog_generator
| Category | Rate | Evaluator Type | Evaluator Name | Criticality |
|----------|------|----------------|----------------|-------------|
| Hallucinated URLs | 13% | Code (URL extraction + HTTP check) | check_urls | required |
| Tone mismatch | 10% | LLM judge | check_tone | required |
| Off-topic drift | 8% | LLM judge | check_topic | required |
| Missing sections | 7% | Code (string contains) | check_sections | required |
| Too short | 5% | Code (length check) | check_length | informational |
This becomes your implementation roadmap. Use criticality: 'required' for failure categories that should block a passing verdict. Use 'informational' for nice-to-have checks.
output-dev-eval-testing to implement each evaluator with verify() and wire them into evalWorkflow()output-eval-judge-prompt to write effective .prompt filesoutput-eval-dataset-design to generate diverse test casesVerdict.contains() worksoutput-dev-eval-testing — Implement evaluators with verify(), Verdict, and evalWorkflow()output-eval-judge-prompt — Design LLM judge prompts for subjective failure modesoutput-eval-dataset-design — Generate diverse datasets when real traces are sparseoutput-eval-validate-judge — Validate LLM judges against human labelsoutput-eval-audit — Audit an existing eval suite for trustworthinessoutput-workflow-trace — Retrieve and analyze workflow execution tracesnpx claudepluginhub growthxai/output --plugin outputaiGuides analysis of LLM pipeline traces to identify, categorize, and prioritize failure modes. Use for new eval projects, pipeline changes, metric drops, or incidents.
Audits an eval suite for trustworthiness by checking error analysis grounding, evaluator design, judge validation, and dataset coverage. Use when inheriting evals, suspecting missed failures, or after pipeline changes.
Produces a structured SHIP/ITERATE/BLOCK triage report from Copilot Studio evaluation results (CSV, summary, or plain text). Grounded in the Practical Guidance on Agent Evaluation 10-step playbook.