From outputai
Designs diverse eval datasets using dimension-based variation. Use when bootstrapping datasets, when real traces are sparse, or when existing datasets miss edge cases.
How this skill is triggered — by the user, by Claude, or both
Slash command
/outputai:output-eval-dataset-designThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Diverse datasets catch more failures. This skill teaches dimension-based dataset design — systematically varying inputs along axes that target failure-prone regions of your workflow. The output is a set of YAML dataset files ready for `output workflow test`.
Diverse datasets catch more failures. This skill teaches dimension-based dataset design — systematically varying inputs along axes that target failure-prone regions of your workflow. The output is a set of YAML dataset files ready for output workflow test.
output-eval-error-analysis) so your dimensions target real failure modes, not guessesIdentify 3+ axes of input variation that target anticipated failure modes. Each dimension should vary one aspect of the input that you expect to influence output quality.
Map failure categories to input properties that trigger them:
| Failure Category | Triggering Input Property | Dimension |
|---|---|---|
| Off-topic drift | Ambiguous or broad topics | Topic specificity: specific / broad / ambiguous |
| Tone mismatch | Conflicting tone signals | Tone difficulty: simple / nuanced / contradictory |
| Too short | Short or vague input | Input detail: minimal / moderate / comprehensive |
| Missing sections | Many explicit requirements | Requirement count: 0 / 1-2 / 5+ |
| Hallucinated URLs | Technical topics with real entities | Entity density: none / few / many |
| Dimension | Values | Why |
|---|---|---|
| Topic complexity | simple, technical, ambiguous | Technical and ambiguous topics trigger more hallucination and drift |
| Tone request | none, formal, casual, contradictory | Explicit tone requests reveal tone-matching failures |
| Length constraint | none, short (100w), long (1000w) | Extreme length constraints trigger truncation and padding |
| Required sections | none, 1 section, 3+ sections | Multiple required sections stress structural compliance |
Aim for 3-5 dimensions. More than 5 creates an unmanageable combinatorial space.
Create ~20 combinations of dimension values. Cover the extremes and the combinations most likely to cause failures.
| # | Topic Complexity | Tone | Length | Required Sections |
|---|---|---|---|---|
| 1 | simple | none | none | none |
| 2 | simple | formal | short | 1 section |
| 3 | technical | formal | long | 3+ sections |
| 4 | technical | casual | short | none |
| 5 | ambiguous | formal | long | 1 section |
| 6 | ambiguous | contradictory | none | 3+ sections |
| 7 | simple | contradictory | long | none |
| 8 | technical | none | none | 3+ sections |
| 9 | ambiguous | casual | short | 1 section |
| 10 | simple | casual | long | 3+ sections |
| 11 | technical | formal | short | 1 section |
| 12 | ambiguous | none | long | none |
| 13 | simple | formal | none | 3+ sections |
| 14 | technical | contradictory | long | 1 section |
| 15 | ambiguous | formal | short | 3+ sections |
| 16 | simple | none | short | 1 section |
| 17 | technical | casual | long | 3+ sections |
| 18 | ambiguous | contradictory | short | none |
| 19 | technical | formal | none | none |
| 20 | ambiguous | casual | long | 3+ sections |
Review each tuple and ask: "Is this a realistic scenario a user might create?" Discard any that aren't.
Transform each tuple into a JSON object matching the workflow's inputSchema from types.ts.
# Find the input schema
cat src/workflows/<workflow_name>/types.ts
For each tuple, write the corresponding JSON input:
Tuple 3: technical + formal + long + 3+ sections
{
"topic": "Quantum error correction techniques in superconducting qubit architectures",
"tone": "formal",
"min_length": 1000,
"required_sections": ["Introduction", "Technical Background", "Current Approaches", "Challenges", "Conclusion"]
}
Tuple 6: ambiguous + contradictory + none + 3+ sections
{
"topic": "Things that matter",
"tone": "Write in a formal academic style but keep it super casual and fun",
"required_sections": ["Overview", "Deep Dive", "Takeaways"]
}
Use realistic, natural-sounding inputs. Avoid test-looking data like "Test topic 1" or "Lorem ipsum."
For 20+ tuples, use an LLM to batch-convert tuples into realistic inputs. Create a prompt that takes the tuple values and the inputSchema, then generates a natural JSON input. Review every generated input for realism before using it.
Run each input through the workflow to capture real output:
# Generate datasets one at a time
npx output workflow dataset generate blog_generator \
--input '{"topic": "Quantum error correction", "tone": "formal", "min_length": 1000}' \
--name technical_formal_long
npx output workflow dataset generate blog_generator \
--input '{"topic": "Things that matter", "tone": "Write formally but keep it casual"}' \
--name ambiguous_contradictory
Each command creates a YAML file in tests/datasets/ with input and last_output populated.
Use names that encode the key dimensions:
tests/datasets/
├── simple_no_constraints.yml
├── simple_formal_short.yml
├── technical_formal_long_sections.yml
├── technical_casual_short.yml
├── ambiguous_formal_long.yml
├── ambiguous_contradictory_sections.yml
└── ...
For many datasets, create a shell script:
#!/bin/bash
# generate_datasets.sh
inputs=(
'{"topic": "Solar panels", "tone": "formal"}|simple_formal'
'{"topic": "Quantum error correction in superconducting qubits", "tone": "casual", "min_length": 1000}|technical_casual_long'
'{"topic": "Things that matter", "required_sections": ["Overview", "Dive", "Takeaways"]}|ambiguous_sections'
)
for entry in "${inputs[@]}"; do
IFS='|' read -r input name <<< "$entry"
echo "Generating: $name"
npx output workflow dataset generate blog_generator --input "$input" --name "$name"
done
After generation, edit each dataset YAML to add ground_truth labels. Review the last_output and assign verdicts per evaluator.
name: ambiguous_contradictory_sections
input:
topic: "Things that matter"
tone: "Write in a formal academic style but keep it super casual and fun"
required_sections: ["Overview", "Deep Dive", "Takeaways"]
last_output:
output:
title: "Things That Matter: A Casual Academic Exploration"
blog_post: "So here's the thing about stuff that matters..."
executionTimeMs: 4200
date: '2026-03-25T00:00:00.000Z'
ground_truth:
human_verdict: fail
notes: "Contradictory tone caused drift; missing 'Deep Dive' section"
evals:
check_tone:
expected_tone: "formal academic style"
verdict: fail
check_topic:
required_topic: "Things that matter"
verdict: partial
check_sections:
required_sections: ["Overview", "Deep Dive", "Takeaways"]
verdict: fail
check_length:
min_length: 100
verdict: pass
human_verdict for every dataset (pass/fail)notes explaining your reasoning for future referenceIf you have production traces available, use them to fill coverage gaps.
# Download production traces
npx output workflow dataset generate blog_generator --download --limit 30
After downloading, check which dimensions are underrepresented:
A good eval dataset combines both:
Review every dataset before using it in evals.
| Workflow Complexity | Minimum Datasets | Target |
|---|---|---|
| Simple (1-2 steps, narrow input) | 10 | 20 |
| Medium (3-5 steps, moderate variation) | 20 | 40 |
| Complex (5+ steps, wide input space) | 40 | 80+ |
More datasets improve eval reliability. Aim for at least 5 datasets per failure category to get meaningful failure rates.
# List all datasets
npx output workflow dataset list blog_generator
# Run evals on all datasets with cached output
npx output workflow test blog_generator --cached
# Run evals on a subset
npx output workflow test blog_generator --cached --dataset technical_formal_long,ambiguous_contradictory
Check that:
human_verdict in ground_truthoutput-eval-error-analysis — Identify failure modes that inform dimension selectionoutput-dev-eval-testing — Dataset YAML format, CLI commands, evalWorkflow() setupoutput-dev-scenario-file — Scenario JSON files as seed inputs for dataset generationoutput-eval-validate-judge — Split datasets into train/dev/test for judge validationoutput-eval-judge-prompt — Design LLM judge prompts that consume these datasetsnpx claudepluginhub growthxai/output --plugin outputaiGenerates synthetic test inputs for LLM pipeline evaluation using dimension-based tuples. Bootstrap eval datasets when real data is sparse or stress-test specific failure hypotheses.
Creates offline evaluation tests for Output SDK workflows using @outputai/evals. Use when writing test evaluators with verify(), creating dataset YAML files, building eval workflows, or running workflow tests via CLI.
Generates evaluation test cases for skills by analyzing skill config and metadata. Bootstraps datasets or expands existing ones for /eval-run.