Skill

benchmark

Run benchmark suites and manage policy evolution — create challengers, compare against champions, promote or rollback policies.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/autodialectics:benchmark

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Use this skill to benchmark policies and drive champion/challenger evolution.

SKILL.md

56 lines · ~490 tokens

Stats

Parent stars0

MaintenanceGood

Last CommitApr 12, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Autodialectics Benchmark & Evolve

Use this skill to benchmark policies and drive champion/challenger evolution.

Prerequisites

autodialectics-mcp must be on PATH (pip install autodialectics).

MCP Workflow

Benchmarking

benchmark(suite_dir?, policy_id?) — run the benchmark suite against a policy. Returns case-by-case results with scores and decisions.

Policy Evolution

evolve_policy(use_gepa?) — analyze recent benchmark reports and create a challenger policy. Set use_gepa: false to skip the GEPA optimizer (simpler heuristic fallback).
promote_policy(policy_id) — promote a challenger to champion if comparison rules allow.
rollback_policy() — revert to the previous champion if the current one regresses.

CLI Fallback

autodialectics benchmark
autodialectics evolve
autodialectics promote <policy_id>
autodialectics rollback

Typical Evolution Cycle

benchmark → evolve_policy → benchmark (with challenger) → compare → promote or rollback

Run benchmarks with the current champion to establish a baseline.
Evolve a challenger from the benchmark reports.
Run the same benchmarks with the challenger's policy ID.
Compare results. Promote if the challenger wins; rollback if it regresses.

Guidance

Never claim a policy is better without benchmark evidence from the same suite.
When reporting benchmark results, include: total cases, pass/fail/revise counts, mean overall score, mean slop composite.
If evolve_policy returns no_reports, run benchmarks first to generate data.
Promotion can be denied by comparison rules — check the response status.

Arguments

If the user passes a suite directory after /autodialectics:benchmark, use it as the benchmark suite path.

benchmark

Invocation

Context Preview

SKILL.md

benchmark

Invocation

Context Preview

SKILL.md

Autodialectics Benchmark & Evolve

Prerequisites

MCP Workflow

Benchmarking

Policy Evolution

CLI Fallback

Typical Evolution Cycle

Guidance

Arguments

Similar Skills

Autodialectics Benchmark & Evolve

Prerequisites

MCP Workflow

Benchmarking

Policy Evolution

CLI Fallback

Typical Evolution Cycle

Guidance

Arguments

Similar Skills