Skill

output-dev-evaluator-function

Creates evaluator functions in evaluators.ts for Output SDK workflows. Handles quality assessment, validation logic, and content evaluation with confidence scores.

TypeScript

testing

backend

Popularity

Parent stars

422

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/outputai:output-dev-evaluator-function

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadWriteEdit

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

This skill documents how to create evaluator functions in `evaluators.ts` for Output SDK workflows. Evaluators are used to assess quality, validate outputs, and provide confidence-scored judgments about workflow results.

SKILL.md

730 lines · ~5.4k tokens(exceeds 5k compaction limit)

Stats

LanguageJavaScript

Parent stars422

Parent forks12

MaintenanceExcellent

Last CommitJun 23, 2026

Actions

View Source View Plugin View on GitHub View README

Creating Evaluator Functions

Overview

This skill documents how to create evaluator functions in evaluators.ts for Output SDK workflows. Evaluators are used to assess quality, validate outputs, and provide confidence-scored judgments about workflow results.

When to Use This Skill

Implementing quality assessment for workflow outputs
Adding validation logic with confidence scores
Creating LLM-powered content evaluation
Building reusable evaluation components

File Organization

Option 1: Flat File (Default)

For smaller workflows, use a single evaluators.ts file:

src/workflows/{workflow-name}/
├── workflow.ts
├── steps.ts
├── evaluators.ts    # All evaluators in one file
├── types.ts
└── ...

Option 2: Folder-Based (Large workflows)

For larger workflows with many evaluators, use an evaluators/ folder:

src/workflows/{workflow-name}/
├── workflow.ts
├── steps.ts
├── evaluators/      # Evaluators split into individual files
│   ├── quality.ts
│   ├── accuracy.ts
│   └── completeness.ts
├── types.ts
└── ...

Component Location Rules

Important: evaluator() calls MUST be in files containing 'evaluators' in the path:

src/workflows/my_workflow/evaluators.ts ✓
src/workflows/my_workflow/evaluators/quality.ts ✓
src/shared/evaluators/common_evaluators.ts ✓
src/workflows/my_workflow/helpers.ts ✗ (cannot contain evaluator() calls)

Activity Isolation Constraints

Evaluators are Temporal activities with strict import rules to ensure deterministic replay.

Evaluators CAN import from:

Local workflow files: ./utils.js, ./types.js, ./helpers.js
Local subdirectories: ./lib/helpers.js
Shared utilities: ../../shared/utils/*.js
Shared clients: ../../shared/clients/*.js
Shared services: ../../shared/services/*.js

Evaluators CANNOT import:

Other evaluator files (activity isolation)
Step files
Workflow files

Example of WRONG imports:

// WRONG - evaluators cannot import other evaluators
import { otherEvaluator } from '../../shared/evaluators/other.js'; // ✗
import { anotherEvaluator } from './other_evaluators.js'; // ✗

Critical Import Patterns

Core Imports

// CORRECT - Import from @outputai/core
import {
  evaluator,
  z,
  EvaluationBooleanResult,
  EvaluationNumberResult,
  EvaluationStringResult,
  EvaluationFeedback
} from '@outputai/core';

// WRONG - Never import z from zod
import { z } from 'zod';

LLM Client Import (for LLM-powered evaluators)

// CORRECT - Use @outputai/llm wrapper
import { generateText, Output } from '@outputai/llm';

// WRONG - Never call LLM providers directly
import OpenAI from 'openai';

ES Module Imports

All imports MUST use .js extension:

// CORRECT
import { BlogContent } from './types.js';

// WRONG - Missing .js extension
import { BlogContent } from './types';

Basic Structure

import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';

export const myEvaluator = evaluator( {
  name: 'my_evaluator',
  description: 'Description of what this evaluator assesses',
  inputSchema: z.object( { /* input schema */ } ),
  fn: async input => {
    // Evaluation logic
    return new EvaluationBooleanResult( {
      value: true,
      confidence: 0.95
    } );
  }
} );

Required Properties

name (string)

Unique identifier for the evaluator. Use snake_case.

name: 'evaluate_content_quality'

description (string)

Human-readable description of what the evaluator assesses.

description: 'Evaluate the quality and completeness of generated content'

inputSchema (Zod schema)

Schema for validating evaluator input.

inputSchema: z.object( {
  content: z.string(),
  expectedLength: z.number()
} )

fn (async function)

The evaluator execution function. Returns an evaluation result with value and confidence.

fn: async input => {
  const isValid = input.content.length >= input.expectedLength;
  return new EvaluationBooleanResult( {
    value: isValid,
    confidence: 0.95
  } );
}

Result Types

EvaluationBooleanResult

Use for pass/fail or true/false evaluations:

import { EvaluationBooleanResult } from '@outputai/core';

return new EvaluationBooleanResult( {
  value: true,           // boolean result
  confidence: 0.95,      // 0.0 to 1.0
  reasoning: 'Optional explanation of the evaluation'
} );

EvaluationNumberResult

Use for numeric scores or ratings:

import { EvaluationNumberResult } from '@outputai/core';

return new EvaluationNumberResult( {
  value: 85,             // numeric result (e.g., 0-100 score)
  confidence: 0.85,      // 0.0 to 1.0
  reasoning: 'Optional explanation of the score'
} );

EvaluationStringResult

Use for categorical or text-based evaluations:

import { EvaluationStringResult } from '@outputai/core';

return new EvaluationStringResult( {
  value: 'positive',     // string result (e.g., category, sentiment, label)
  confidence: 0.9,       // 0.0 to 1.0
  reasoning: 'Optional explanation of the classification'
} );

Result Properties

Property	Type	Required	Description
`value`	`boolean`, `number`, or `string`	Yes	The evaluation result
`confidence`	`number` (0.0-1.0)	Yes	Confidence in the evaluation
`reasoning`	`string`	No	Explanation of the evaluation
`name`	`string`	No	Name for this specific result (useful in dimensions)
`feedback`	`EvaluationFeedback[]`	No	Array of feedback objects with issues and suggestions
`dimensions`	`EvaluationResult[]`	No	Nested results for multi-dimensional evaluation

Simple Evaluator Examples

Boolean Evaluator - Content Validation

import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';

export const evaluateCompleteness = evaluator( {
  name: 'evaluate_completeness',
  description: 'Check if content meets minimum length requirements',
  inputSchema: z.object( {
    content: z.string(),
    minLength: z.number().default( 100 )
  } ),
  fn: async ( { content, minLength } ) => {
    const isComplete = content.length >= minLength;

    return new EvaluationBooleanResult( {
      value: isComplete,
      confidence: 1.0,
      reasoning: isComplete ?
        `Content has ${content.length} characters, meets minimum of ${minLength}` :
        `Content has ${content.length} characters, below minimum of ${minLength}`
    } );
  }
} );

Boolean Evaluator - Pattern Detection

import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';

export const evaluateGibberish = evaluator( {
  name: 'evaluate_gibberish',
  description: 'Check if a given string is gibberish',
  inputSchema: z.string(),
  fn: async content => {
    const gibberishPatterns = [ 'foo', 'bar', 'lorem', 'ipsum' ];
    const isGibberish = gibberishPatterns.some( p => content.toLowerCase().includes( p ) );

    return new EvaluationBooleanResult( {
      value: !isGibberish,
      confidence: 0.95
    } );
  }
} );

Number Evaluator - Quality Score

import { evaluator, z, EvaluationNumberResult } from '@outputai/core';

export const evaluateReadability = evaluator( {
  name: 'evaluate_readability',
  description: 'Calculate readability score based on sentence structure',
  inputSchema: z.object( {
    content: z.string()
  } ),
  fn: async ( { content } ) => {
    const sentences = content.split( /[.!?]+/ ).filter( s => s.trim() );
    const words = content.split( /\s+/ ).filter( w => w.trim() );
    const avgWordsPerSentence = words.length / Math.max( sentences.length, 1 );

    // Simple readability score (lower avg words = more readable)
    const score = Math.max( 0, Math.min( 100, 100 - ( avgWordsPerSentence - 15 ) * 5 ) );

    return new EvaluationNumberResult( {
      value: Math.round( score ),
      confidence: 0.8,
      reasoning: `Average ${avgWordsPerSentence.toFixed( 1 )} words per sentence`
    } );
  }
} );

String Evaluator - Sentiment Classification

import { evaluator, z, EvaluationStringResult } from '@outputai/core';

export const evaluateSentiment = evaluator( {
  name: 'evaluate_sentiment',
  description: 'Classify the sentiment of content',
  inputSchema: z.object( {
    content: z.string()
  } ),
  fn: async ( { content } ) => {
    const positiveWords = [ 'great', 'excellent', 'amazing', 'good', 'love' ];
    const negativeWords = [ 'bad', 'terrible', 'awful', 'hate', 'poor' ];

    const lowerContent = content.toLowerCase();
    const positiveCount = positiveWords.filter( w => lowerContent.includes( w ) ).length;
    const negativeCount = negativeWords.filter( w => lowerContent.includes( w ) ).length;

    const { sentiment, confidence } = positiveCount > negativeCount ?
      { sentiment: 'positive', confidence: Math.min( 0.95, 0.6 + positiveCount * 0.1 ) } :
      negativeCount > positiveCount ?
        { sentiment: 'negative', confidence: Math.min( 0.95, 0.6 + negativeCount * 0.1 ) } :
        { sentiment: 'neutral', confidence: 0.7 };

    return new EvaluationStringResult( {
      value: sentiment,
      confidence,
      reasoning: `Found ${positiveCount} positive and ${negativeCount} negative indicators`
    } );
  }
} );

LLM-Powered Evaluator Examples

Note: Evaluators are self-contained components that don't share schemas across steps, so defining Output.object() schemas inline is acceptable here. For workflow steps that share schemas, define them in types.ts instead.

Using generateText with Output.object() for Evaluation

import { evaluator, z, EvaluationNumberResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';

export const evaluateSignalToNoise = evaluator( {
  name: 'evaluate_signal_to_noise',
  description: 'Evaluate the signal-to-noise ratio of content',
  inputSchema: z.object( {
    title: z.string(),
    content: z.string()
  } ),
  fn: async ( { title, content } ) => {
    const { output } = await generateText( {
      prompt: 'signal_noise@v1',  // References prompts/[email protected]
      variables: {
        title,
        content
      },
      output: Output.object( {
        schema: z.object( {
          score: z.number().describe( 'Signal-to-noise score 0-100' )
        } )
      } )
    } );

    return new EvaluationNumberResult( {
      value: output.score,
      confidence: 0.85
    } );
  }
} );

LLM Boolean Evaluation

import { evaluator, z, EvaluationBooleanResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';

export const evaluateFactualAccuracy = evaluator( {
  name: 'evaluate_factual_accuracy',
  description: 'Check if content contains factual claims that can be verified',
  inputSchema: z.object( {
    content: z.string(),
    topic: z.string()
  } ),
  fn: async ( { content, topic } ) => {
    const { output } = await generateText( {
      prompt: 'factual_check@v1',
      variables: { content, topic },
      output: Output.object( {
        schema: z.object( {
          isFactual: z.boolean().describe( 'Whether content appears factually accurate' ),
          confidence: z.number().describe( 'Confidence in assessment 0-1' ),
          issues: z.array( z.string() ).optional().describe( 'Any factual issues found' )
        } )
      } )
    } );

    return new EvaluationBooleanResult( {
      value: output.isFactual,
      confidence: output.confidence,
      reasoning: output.issues?.length ?
        `Issues found: ${output.issues.join( ', ' )}` :
        'No factual issues detected'
    } );
  }
} );

LLM String Evaluation - Content Classification

import { evaluator, z, EvaluationStringResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';

export const evaluateContentCategory = evaluator( {
  name: 'evaluate_content_category',
  description: 'Classify content into a category',
  inputSchema: z.object( {
    content: z.string(),
    categories: z.array( z.string() )
  } ),
  fn: async ( { content, categories } ) => {
    const { output } = await generateText( {
      prompt: 'categorize_content@v1',
      variables: {
        content,
        categories: categories.join( ', ' )
      },
      output: Output.object( {
        schema: z.object( {
          category: z.string().describe( 'The best matching category' ),
          confidence: z.number().describe( 'Confidence in classification 0-1' ),
          explanation: z.string().describe( 'Why this category was chosen' )
        } )
      } )
    } );

    return new EvaluationStringResult( {
      value: output.category,
      confidence: output.confidence,
      reasoning: output.explanation
    } );
  }
} );

EvaluationResult with Feedback

Use the feedback field to provide actionable improvement suggestions alongside your evaluation result. Import EvaluationFeedback from @outputai/core to create feedback objects.

import { evaluator, z, EvaluationStringResult, EvaluationFeedback } from '@outputai/core';

export const evaluateWithFeedback = evaluator( {
  name: 'evaluate_with_feedback',
  description: 'Evaluate content quality and provide actionable feedback',
  inputSchema: z.string(),
  fn: async response => {
    const feedback = [];

    if ( response.length < 50 ) {
      feedback.push( new EvaluationFeedback( {
        issue: 'Response is too short',
        suggestion: 'Expand the response with more detail',
        priority: 'medium'
      } ) );
    }

    return new EvaluationStringResult( {
      value: feedback.length === 0 ? 'good' : 'needs_improvement',
      confidence: 0.85,
      feedback: feedback
    } );
  }
} );

EvaluationFeedback Properties

Property	Type	Description
`issue`	`string`	The problem identified
`suggestion`	`string`	Recommended fix
`priority`	`string`	Priority level (e.g., `'low'`, `'medium'`, `'high'`)

Multi-Dimensional Evaluation

Use the dimensions field to nest EvaluationResult instances for sub-scores. Each dimension should use the name field to identify it.

import { evaluator, z, EvaluationStringResult, EvaluationNumberResult } from '@outputai/core';

export const evaluateMultiDimensional = evaluator( {
  name: 'evaluate_multi_dimensional',
  description: 'Evaluate content across multiple quality dimensions',
  inputSchema: z.string(),
  fn: async response => {
    const coherenceScore = calculateCoherence( response );
    const relevanceScore = calculateRelevance( response );
    const overallScore = ( coherenceScore + relevanceScore ) / 2;

    return new EvaluationStringResult( {
      value: overallScore > 0.7 ? 'high_quality' : 'low_quality',
      confidence: 0.9,
      dimensions: [
        new EvaluationNumberResult( {
          value: coherenceScore,
          confidence: 0.85,
          name: 'coherence'
        } ),
        new EvaluationNumberResult( {
          value: relevanceScore,
          confidence: 0.88,
          name: 'relevance'
        } )
      ]
    } );
  }
} );

Complete Example

Based on a real workflow evaluator file:

import { evaluator, z, EvaluationBooleanResult, EvaluationNumberResult } from '@outputai/core';
import { generateText, Output } from '@outputai/llm';
import { blogContentSchema } from './types.js';
import type { BlogContent, QualityMetrics } from './types.js';

// Simple boolean evaluator
export const evaluateMinimumLength = evaluator( {
  name: 'evaluate_minimum_length',
  description: 'Check if blog content meets minimum length requirements',
  inputSchema: blogContentSchema,
  fn: async ( input: BlogContent ) => {
    const MIN_TOKENS = 500;
    const meetsRequirement = input.tokenCount >= MIN_TOKENS;

    return new EvaluationBooleanResult( {
      value: meetsRequirement,
      confidence: 1.0,
      reasoning: `Content has ${input.tokenCount} tokens (minimum: ${MIN_TOKENS})`
    } );
  }
} );

// LLM-powered number evaluator
export const evaluateSignalToNoise = evaluator( {
  name: 'evaluate_signal_to_noise',
  description: 'Evaluate the signal-to-noise ratio of blog content',
  inputSchema: blogContentSchema,
  fn: async ( input: BlogContent ) => {
    const { output } = await generateText( {
      prompt: 'signal_noise@v1',
      variables: {
        title: input.title,
        content: input.content
      },
      output: Output.object( {
        schema: z.object( {
          score: z.number().describe( 'Signal-to-noise score 0-100' )
        } )
      } )
    } );

    return new EvaluationNumberResult( {
      value: output.score,
      confidence: 0.85
    } );
  }
} );

// LLM-powered boolean evaluator
export const evaluateRelevance = evaluator( {
  name: 'evaluate_relevance',
  description: 'Check if content is relevant to the stated topic',
  inputSchema: z.object( {
    content: z.string(),
    topic: z.string(),
    keywords: z.array( z.string() )
  } ),
  fn: async ( { content, topic, keywords } ) => {
    const { output } = await generateText( {
      prompt: 'relevance_check@v1',
      variables: { content, topic, keywords: keywords.join( ', ' ) },
      output: Output.object( {
        schema: z.object( {
          isRelevant: z.boolean(),
          relevanceScore: z.number().describe( 'Relevance score 0-1' ),
          explanation: z.string()
        } )
      } )
    } );

    return new EvaluationBooleanResult( {
      value: output.isRelevant,
      confidence: output.relevanceScore,
      reasoning: output.explanation
    } );
  }
} );

Best Practices

1. Use Appropriate Result Types

// Boolean for pass/fail decisions
return new EvaluationBooleanResult( { value: true, confidence: 0.9 } );

// Number for scores and ratings
return new EvaluationNumberResult( { value: 85, confidence: 0.85 } );

// String for categories, labels, or classifications
return new EvaluationStringResult( { value: 'positive', confidence: 0.9 } );

2. Provide Meaningful Confidence Scores

// High confidence for deterministic checks
confidence: 1.0  // e.g., length checks, pattern matching

// Medium confidence for heuristic-based evaluations
confidence: 0.85  // e.g., LLM-based assessments

// Lower confidence for uncertain evaluations
confidence: 0.7  // e.g., subjective quality judgments

3. Include Reasoning for Transparency

return new EvaluationBooleanResult( {
  value: false,
  confidence: 0.95,
  reasoning: `Content contains ${errorCount} grammatical errors, exceeding threshold of ${maxErrors}`
} );

4. Keep Evaluators Focused

// Good - single responsibility
export const evaluateGrammar = evaluator( { ... } );
export const evaluateReadability = evaluator( { ... } );
export const evaluateTone = evaluator( { ... } );

// Avoid - doing too much in one evaluator
export const evaluateEverything = evaluator( { ... } );

5. Use Descriptive Names

// Good - clear what is being evaluated
name: 'evaluate_content_originality'
name: 'evaluate_factual_accuracy'
name: 'evaluate_sentiment_alignment'

// Avoid - vague names
name: 'check'
name: 'validate'
name: 'evaluate_stuff'

6. Use Feedback for Actionable Improvements

feedback: [
  new EvaluationFeedback( {
    issue: 'Missing conclusion paragraph',
    suggestion: 'Add a summary paragraph at the end',
    priority: 'high'
  } )
]

7. Use Dimensions for Multi-Criteria Evaluation

dimensions: [
  new EvaluationNumberResult( { value: 8, confidence: 0.9, name: 'coherence' } ),
  new EvaluationNumberResult( { value: 6, confidence: 0.85, name: 'relevance' } )
]

Verification Checklist

Related Skills

output-dev-workflow-function - Orchestrating evaluators in workflow.ts
output-dev-step-function - Creating step functions
output-dev-types-file - Defining evaluator input schemas
output-dev-prompt-file - Creating prompt files for LLM-powered evaluators
output-dev-folder-structure - Understanding project layout
output-eval-error-analysis — Identify what to evaluate before writing evaluators

output-dev-evaluator-function

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

output-dev-evaluator-function

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Creating Evaluator Functions

Overview

When to Use This Skill

File Organization

Option 1: Flat File (Default)

Option 2: Folder-Based (Large workflows)

Component Location Rules

Activity Isolation Constraints

Evaluators CAN import from:

Evaluators CANNOT import:

Critical Import Patterns

Core Imports

LLM Client Import (for LLM-powered evaluators)

ES Module Imports

Basic Structure

Required Properties

name (string)

description (string)

inputSchema (Zod schema)

fn (async function)

Result Types

EvaluationBooleanResult

EvaluationNumberResult

EvaluationStringResult

Result Properties

Simple Evaluator Examples

Boolean Evaluator - Content Validation

Boolean Evaluator - Pattern Detection

Number Evaluator - Quality Score

String Evaluator - Sentiment Classification

LLM-Powered Evaluator Examples

Using generateText with Output.object() for Evaluation

LLM Boolean Evaluation

LLM String Evaluation - Content Classification

EvaluationResult with Feedback

EvaluationFeedback Properties

Multi-Dimensional Evaluation

Complete Example

Best Practices

1. Use Appropriate Result Types

2. Provide Meaningful Confidence Scores

3. Include Reasoning for Transparency

4. Keep Evaluators Focused

5. Use Descriptive Names

6. Use Feedback for Actionable Improvements

7. Use Dimensions for Multi-Criteria Evaluation

Verification Checklist

Related Skills

Similar Skills

Creating Evaluator Functions

Overview

When to Use This Skill

File Organization

Option 1: Flat File (Default)

Option 2: Folder-Based (Large workflows)

Component Location Rules

Activity Isolation Constraints

Evaluators CAN import from:

Evaluators CANNOT import:

Critical Import Patterns

Core Imports

LLM Client Import (for LLM-powered evaluators)

ES Module Imports

Basic Structure

Required Properties

name (string)

description (string)

inputSchema (Zod schema)