From outputai
Creates evaluator functions in evaluators.ts for Output SDK workflows. Handles quality assessment, validation logic, and content evaluation with confidence scores.
How this skill is triggered — by the user, by Claude, or both
Slash command
/outputai:output-dev-evaluator-functionThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill documents how to create evaluator functions in `evaluators.ts` for Output SDK workflows. Evaluators are used to assess quality, validate outputs, and provide confidence-scored judgments about workflow results.
This skill documents how to create evaluator functions in evaluators.ts for Output SDK workflows. Evaluators are used to assess quality, validate outputs, and provide confidence-scored judgments about workflow results.
For smaller workflows, use a single evaluators.ts file:
src/workflows/{workflow-name}/
├── workflow.ts
├── steps.ts
├── evaluators.ts # All evaluators in one file
├── types.ts
└── ...
For larger workflows with many evaluators, use an evaluators/ folder:
src/workflows/{workflow-name}/
├── workflow.ts
├── steps.ts
├── evaluators/ # Evaluators split into individual files
│ ├── quality.ts
│ ├── accuracy.ts
│ └── completeness.ts
├── types.ts
└── ...
Important: evaluator() calls MUST be in files containing 'evaluators' in the path:
src/workflows/my_workflow/evaluators.ts ✓src/workflows/my_workflow/evaluators/quality.ts ✓src/shared/evaluators/common_evaluators.ts ✓src/workflows/my_workflow/helpers.ts ✗ (cannot contain evaluator() calls)Evaluators are Temporal activities with strict import rules to ensure deterministic replay.
./utils.js, ./types.js, ./helpers.js./lib/helpers.js../../shared/utils/*.js../../shared/clients/*.js../../shared/services/*.jsExample of WRONG imports:
// WRONG - evaluators cannot import other evaluators
import { otherEvaluator } from '../../shared/evaluators/other.js'; // ✗
import { anotherEvaluator } from './other_evaluators.js'; // ✗
// CORRECT - Import from @outputai/core
import {
evaluator,
z,
EvaluationBooleanResult,
EvaluationNumberResult,
EvaluationStringResult,
EvaluationFeedback
} from '@outputai/core';
// WRONG - Never import z from zod
import { z } from 'zod';
// CORRECT - Use @outputai/llm wrapper
import { generateText, Output } from '@outputai/llm';
// WRONG - Never call LLM providers directly
import OpenAI from 'openai';
All imports MUST use .js extension:
// CORRECT
import { BlogContent } from './types.js';
// WRONG - Missing .js extension
import { BlogContent } from './types';
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
export const myEvaluator = evaluator( {
name: 'my_evaluator',
description: 'Description of what this evaluator assesses',
inputSchema: z.object( { /* input schema */ } ),
fn: async input => {
// Evaluation logic
return new EvaluationBooleanResult( {
value: true,
confidence: 0.95
} );
}
} );
Unique identifier for the evaluator. Use snake_case.
name: 'evaluate_content_quality'
Human-readable description of what the evaluator assesses.
description: 'Evaluate the quality and completeness of generated content'
Schema for validating evaluator input.
inputSchema: z.object( {
content: z.string(),
expectedLength: z.number()
} )
The evaluator execution function. Returns an evaluation result with value and confidence.
fn: async input => {
const isValid = input.content.length >= input.expectedLength;
return new EvaluationBooleanResult( {
value: isValid,
confidence: 0.95
} );
}
Use for pass/fail or true/false evaluations:
import { EvaluationBooleanResult } from '@outputai/core';
return new EvaluationBooleanResult( {
value: true, // boolean result
confidence: 0.95, // 0.0 to 1.0
reasoning: 'Optional explanation of the evaluation'
} );
Use for numeric scores or ratings:
import { EvaluationNumberResult } from '@outputai/core';
return new EvaluationNumberResult( {
value: 85, // numeric result (e.g., 0-100 score)
confidence: 0.85, // 0.0 to 1.0
reasoning: 'Optional explanation of the score'
} );
Use for categorical or text-based evaluations:
import { EvaluationStringResult } from '@outputai/core';
return new EvaluationStringResult( {
value: 'positive', // string result (e.g., category, sentiment, label)
confidence: 0.9, // 0.0 to 1.0
reasoning: 'Optional explanation of the classification'
} );
| Property | Type | Required | Description |
|---|---|---|---|
value | boolean, number, or string | Yes | The evaluation result |
confidence | number (0.0-1.0) | Yes | Confidence in the evaluation |
reasoning | string | No | Explanation of the evaluation |
name | string | No | Name for this specific result (useful in dimensions) |
feedback | EvaluationFeedback[] | No | Array of feedback objects with issues and suggestions |
dimensions | EvaluationResult[] | No | Nested results for multi-dimensional evaluation |
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
export const evaluateCompleteness = evaluator( {
name: 'evaluate_completeness',
description: 'Check if content meets minimum length requirements',
inputSchema: z.object( {
content: z.string(),
minLength: z.number().default( 100 )
} ),
fn: async ( { content, minLength } ) => {
const isComplete = content.length >= minLength;
return new EvaluationBooleanResult( {
value: isComplete,
confidence: 1.0,
reasoning: isComplete ?
`Content has ${content.length} characters, meets minimum of ${minLength}` :
`Content has ${content.length} characters, below minimum of ${minLength}`
} );
}
} );
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
export const evaluateGibberish = evaluator( {
name: 'evaluate_gibberish',
description: 'Check if a given string is gibberish',
inputSchema: z.string(),
fn: async content => {
const gibberishPatterns = [ 'foo', 'bar', 'lorem', 'ipsum' ];
const isGibberish = gibberishPatterns.some( p => content.toLowerCase().includes( p ) );
return new EvaluationBooleanResult( {
value: !isGibberish,
confidence: 0.95
} );
}
} );
import { evaluator, z, EvaluationNumberResult } from '@outputai/core';
export const evaluateReadability = evaluator( {
name: 'evaluate_readability',
description: 'Calculate readability score based on sentence structure',
inputSchema: z.object( {
content: z.string()
} ),
fn: async ( { content } ) => {
const sentences = content.split( /[.!?]+/ ).filter( s => s.trim() );
const words = content.split( /\s+/ ).filter( w => w.trim() );
const avgWordsPerSentence = words.length / Math.max( sentences.length, 1 );
// Simple readability score (lower avg words = more readable)
const score = Math.max( 0, Math.min( 100, 100 - ( avgWordsPerSentence - 15 ) * 5 ) );
return new EvaluationNumberResult( {
value: Math.round( score ),
confidence: 0.8,
reasoning: `Average ${avgWordsPerSentence.toFixed( 1 )} words per sentence`
} );
}
} );
import { evaluator, z, EvaluationStringResult } from '@outputai/core';
export const evaluateSentiment = evaluator( {
name: 'evaluate_sentiment',
description: 'Classify the sentiment of content',
inputSchema: z.object( {
content: z.string()
} ),
fn: async ( { content } ) => {
const positiveWords = [ 'great', 'excellent', 'amazing', 'good', 'love' ];
const negativeWords = [ 'bad', 'terrible', 'awful', 'hate', 'poor' ];
const lowerContent = content.toLowerCase();
const positiveCount = positiveWords.filter( w => lowerContent.includes( w ) ).length;
const negativeCount = negativeWords.filter( w => lowerContent.includes( w ) ).length;
const { sentiment, confidence } = positiveCount > negativeCount ?
{ sentiment: 'positive', confidence: Math.min( 0.95, 0.6 + positiveCount * 0.1 ) } :
negativeCount > positiveCount ?
{ sentiment: 'negative', confidence: Math.min( 0.95, 0.6 + negativeCount * 0.1 ) } :
{ sentiment: 'neutral', confidence: 0.7 };
return new EvaluationStringResult( {
value: sentiment,
confidence,
reasoning: `Found ${positiveCount} positive and ${negativeCount} negative indicators`
} );
}
} );
Note: Evaluators are self-contained components that don't share schemas across steps, so defining Output.object() schemas inline is acceptable here. For workflow steps that share schemas, define them in types.ts instead.
import { evaluator, z, EvaluationNumberResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
export const evaluateSignalToNoise = evaluator( {
name: 'evaluate_signal_to_noise',
description: 'Evaluate the signal-to-noise ratio of content',
inputSchema: z.object( {
title: z.string(),
content: z.string()
} ),
fn: async ( { title, content } ) => {
const { output } = await generateText( {
prompt: 'signal_noise@v1', // References prompts/[email protected]
variables: {
title,
content
},
output: Output.object( {
schema: z.object( {
score: z.number().describe( 'Signal-to-noise score 0-100' )
} )
} )
} );
return new EvaluationNumberResult( {
value: output.score,
confidence: 0.85
} );
}
} );
import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
export const evaluateFactualAccuracy = evaluator( {
name: 'evaluate_factual_accuracy',
description: 'Check if content contains factual claims that can be verified',
inputSchema: z.object( {
content: z.string(),
topic: z.string()
} ),
fn: async ( { content, topic } ) => {
const { output } = await generateText( {
prompt: 'factual_check@v1',
variables: { content, topic },
output: Output.object( {
schema: z.object( {
isFactual: z.boolean().describe( 'Whether content appears factually accurate' ),
confidence: z.number().describe( 'Confidence in assessment 0-1' ),
issues: z.array( z.string() ).optional().describe( 'Any factual issues found' )
} )
} )
} );
return new EvaluationBooleanResult( {
value: output.isFactual,
confidence: output.confidence,
reasoning: output.issues?.length ?
`Issues found: ${output.issues.join( ', ' )}` :
'No factual issues detected'
} );
}
} );
import { evaluator, z, EvaluationStringResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
export const evaluateContentCategory = evaluator( {
name: 'evaluate_content_category',
description: 'Classify content into a category',
inputSchema: z.object( {
content: z.string(),
categories: z.array( z.string() )
} ),
fn: async ( { content, categories } ) => {
const { output } = await generateText( {
prompt: 'categorize_content@v1',
variables: {
content,
categories: categories.join( ', ' )
},
output: Output.object( {
schema: z.object( {
category: z.string().describe( 'The best matching category' ),
confidence: z.number().describe( 'Confidence in classification 0-1' ),
explanation: z.string().describe( 'Why this category was chosen' )
} )
} )
} );
return new EvaluationStringResult( {
value: output.category,
confidence: output.confidence,
reasoning: output.explanation
} );
}
} );
Use the feedback field to provide actionable improvement suggestions alongside your evaluation result. Import EvaluationFeedback from @outputai/core to create feedback objects.
import { evaluator, z, EvaluationStringResult, EvaluationFeedback } from '@outputai/core';
export const evaluateWithFeedback = evaluator( {
name: 'evaluate_with_feedback',
description: 'Evaluate content quality and provide actionable feedback',
inputSchema: z.string(),
fn: async response => {
const feedback = [];
if ( response.length < 50 ) {
feedback.push( new EvaluationFeedback( {
issue: 'Response is too short',
suggestion: 'Expand the response with more detail',
priority: 'medium'
} ) );
}
return new EvaluationStringResult( {
value: feedback.length === 0 ? 'good' : 'needs_improvement',
confidence: 0.85,
feedback: feedback
} );
}
} );
| Property | Type | Description |
|---|---|---|
issue | string | The problem identified |
suggestion | string | Recommended fix |
priority | string | Priority level (e.g., 'low', 'medium', 'high') |
Use the dimensions field to nest EvaluationResult instances for sub-scores. Each dimension should use the name field to identify it.
import { evaluator, z, EvaluationStringResult, EvaluationNumberResult } from '@outputai/core';
export const evaluateMultiDimensional = evaluator( {
name: 'evaluate_multi_dimensional',
description: 'Evaluate content across multiple quality dimensions',
inputSchema: z.string(),
fn: async response => {
const coherenceScore = calculateCoherence( response );
const relevanceScore = calculateRelevance( response );
const overallScore = ( coherenceScore + relevanceScore ) / 2;
return new EvaluationStringResult( {
value: overallScore > 0.7 ? 'high_quality' : 'low_quality',
confidence: 0.9,
dimensions: [
new EvaluationNumberResult( {
value: coherenceScore,
confidence: 0.85,
name: 'coherence'
} ),
new EvaluationNumberResult( {
value: relevanceScore,
confidence: 0.88,
name: 'relevance'
} )
]
} );
}
} );
Based on a real workflow evaluator file:
import { evaluator, z, EvaluationBooleanResult, EvaluationNumberResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
import { blogContentSchema } from './types.js';
import type { BlogContent, QualityMetrics } from './types.js';
// Simple boolean evaluator
export const evaluateMinimumLength = evaluator( {
name: 'evaluate_minimum_length',
description: 'Check if blog content meets minimum length requirements',
inputSchema: blogContentSchema,
fn: async ( input: BlogContent ) => {
const MIN_TOKENS = 500;
const meetsRequirement = input.tokenCount >= MIN_TOKENS;
return new EvaluationBooleanResult( {
value: meetsRequirement,
confidence: 1.0,
reasoning: `Content has ${input.tokenCount} tokens (minimum: ${MIN_TOKENS})`
} );
}
} );
// LLM-powered number evaluator
export const evaluateSignalToNoise = evaluator( {
name: 'evaluate_signal_to_noise',
description: 'Evaluate the signal-to-noise ratio of blog content',
inputSchema: blogContentSchema,
fn: async ( input: BlogContent ) => {
const { output } = await generateText( {
prompt: 'signal_noise@v1',
variables: {
title: input.title,
content: input.content
},
output: Output.object( {
schema: z.object( {
score: z.number().describe( 'Signal-to-noise score 0-100' )
} )
} )
} );
return new EvaluationNumberResult( {
value: output.score,
confidence: 0.85
} );
}
} );
// LLM-powered boolean evaluator
export const evaluateRelevance = evaluator( {
name: 'evaluate_relevance',
description: 'Check if content is relevant to the stated topic',
inputSchema: z.object( {
content: z.string(),
topic: z.string(),
keywords: z.array( z.string() )
} ),
fn: async ( { content, topic, keywords } ) => {
const { output } = await generateText( {
prompt: 'relevance_check@v1',
variables: { content, topic, keywords: keywords.join( ', ' ) },
output: Output.object( {
schema: z.object( {
isRelevant: z.boolean(),
relevanceScore: z.number().describe( 'Relevance score 0-1' ),
explanation: z.string()
} )
} )
} );
return new EvaluationBooleanResult( {
value: output.isRelevant,
confidence: output.relevanceScore,
reasoning: output.explanation
} );
}
} );
// Boolean for pass/fail decisions
return new EvaluationBooleanResult( { value: true, confidence: 0.9 } );
// Number for scores and ratings
return new EvaluationNumberResult( { value: 85, confidence: 0.85 } );
// String for categories, labels, or classifications
return new EvaluationStringResult( { value: 'positive', confidence: 0.9 } );
// High confidence for deterministic checks
confidence: 1.0 // e.g., length checks, pattern matching
// Medium confidence for heuristic-based evaluations
confidence: 0.85 // e.g., LLM-based assessments
// Lower confidence for uncertain evaluations
confidence: 0.7 // e.g., subjective quality judgments
return new EvaluationBooleanResult( {
value: false,
confidence: 0.95,
reasoning: `Content contains ${errorCount} grammatical errors, exceeding threshold of ${maxErrors}`
} );
// Good - single responsibility
export const evaluateGrammar = evaluator( { ... } );
export const evaluateReadability = evaluator( { ... } );
export const evaluateTone = evaluator( { ... } );
// Avoid - doing too much in one evaluator
export const evaluateEverything = evaluator( { ... } );
// Good - clear what is being evaluated
name: 'evaluate_content_originality'
name: 'evaluate_factual_accuracy'
name: 'evaluate_sentiment_alignment'
// Avoid - vague names
name: 'check'
name: 'validate'
name: 'evaluate_stuff'
feedback: [
new EvaluationFeedback( {
issue: 'Missing conclusion paragraph',
suggestion: 'Add a summary paragraph at the end',
priority: 'high'
} )
]
dimensions: [
new EvaluationNumberResult( { value: 8, confidence: 0.9, name: 'coherence' } ),
new EvaluationNumberResult( { value: 6, confidence: 0.85, name: 'relevance' } )
]
evaluator, z, result types imported from @outputai/coregenerateText and Output imported from @outputai/llm if using LLM (not direct provider).describe() instead of .min()/.max() on z.number().js extensionname, description, inputSchema, fnsnake_caseEvaluationBooleanResult, EvaluationNumberResult, or EvaluationStringResult)EvaluationFeedback imported from @outputai/core when using feedbackissue, suggestion, and priorityname field to identify sub-evaluationsoutput-dev-workflow-function - Orchestrating evaluators in workflow.tsoutput-dev-step-function - Creating step functionsoutput-dev-types-file - Defining evaluator input schemasoutput-dev-prompt-file - Creating prompt files for LLM-powered evaluatorsoutput-dev-folder-structure - Understanding project layoutoutput-eval-error-analysis — Identify what to evaluate before writing evaluatorsnpx claudepluginhub growthxai/output --plugin outputaiCreates offline evaluation tests for Output SDK workflows using @outputai/evals. Use when writing test evaluators with verify(), creating dataset YAML files, building eval workflows, or running workflow tests via CLI.
Builds LangSmith evaluation pipelines: create LLM-as-Judge/custom evaluators, capture agent outputs/trajectories via run functions, run locally with evaluate() or CLI.
Creates and runs LLM-as-judge and code evaluators on Arize: hallucination, faithfulness, correctness, relevance scoring, experiment evaluation, column mapping, and continuous monitoring.