Search everything...

Stats

Actions

Available In

eval-guide

Name: eval-guide
Author: microsoft

By microsoft

Plan, generate, and triage evaluations for Microsoft Copilot Studio agents. Start from an agent description to populate an eval planning workbook, then generate importable test sets (CSV + .docx manifest). After running evals, produce structured SHIP/ITERATE/BLOCK triage reports with failure root-cause diagnosis and actionable fixes, all grounded in Microsoft's Eval Scenario Library and Triage & Improvement Playbook.

testing

ai-ml

npx claudepluginhub microsoft/eval-guide

Popularity

Stars

Top 25%

Med: 0·Avg: 281

Installs

Med: 0·Avg: 1

What's Inside

Skills5

eval-faq

/eval-faq

Answers AI agent evaluation methodology questions with practical, opinionated guidance grounded primarily in Microsoft's agent evaluation ecosystem (MS Learn, Eval Scenario Library, Triage & Improvement Playbook, Eval Guidance Kit) supplemented by select industry sources.

eval-generator

/eval-generator

Generate standalone — turns the populated Eval Suite Planning workbook (output of `/eval-suite-planner`) into concrete capability eval sets and trust & safety eval sets. Delivers playbook Steps 2 & 3 and designs the Step 8 regression partition. Outputs 2-column Copilot Studio `-for-import.csv` files (Question + Expected response only), a customer-ready `.docx` manifest report, and an `eval-setup-guide.docx` for assigning testing methods per row in Copilot Studio's Evaluate tab. Use after planning, before running.

eval-result-interpreter

/eval-result-interpreter

Analyzes Copilot Studio evaluation results using Practical Guidance on Agent Evaluation's 10-step playbook (Steps 6, 7, and 9) plus Microsoft's triage diagnostics. Returns a gate-based SHIP / ITERATE / BLOCK verdict with root cause classification, remediation, and pattern analysis.

eval-suite-planner

/eval-suite-planner

Plan standalone — populates the Eval Suite Planning & Logging Template from an Agent Vision or plain-English agent description. Grounded in Practical Guidance on Agent Evaluation v5: Step 1 planning, Steps 2-3 eval-set decomposition, Step 4 gates/improvement targets, Step 5 human inputs, Step 6 grader-validation planning, Step 7 baseline placeholders, Step 8 regression partitioning, and Step 10 reusable-asset candidates. Output is a template-preserving `.xlsx` workbook plus an interactive HTML review page. Use before generating test cases or running evals.

eval-triage-and-improvement

/eval-triage-and-improvement

Use this skill when the user's Copilot Studio agent evaluations have come back and they need to interpret scores, diagnose root causes of underperforming test cases, find remediation steps, or analyze patterns to improve their agent. Always use this skill when the user mentions: "eval failed", "why did this fail", "triage", "diagnose failure", "low pass rate", "fix evaluation results", "not passing", "failing test cases", "evaluation results", "improve my eval scores", or any situation where eval scores need interpretation and action.

Stats

Version1.0.0

LanguageHTML

Stars14

Forks5

MaintenanceExcellent

LicenseMIT

Last CommitJun 24, 2026

AddedMar 30, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Own this plugin?

Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).

README

eval-guide

AI agent evaluation toolkit for Copilot Studio. Plan evals, generate test cases, interpret results, and triage failures — from Claude Code or GitHub Copilot.

Grounded in Microsoft's Eval Scenario Library, Triage & Improvement Playbook, Common Evaluation Approaches, and MS Learn agent evaluation documentation.

Install

Claude Code

claude plugin marketplace add microsoft/eval-guide
claude plugin install eval-guide@eval-guide

GitHub Copilot

npx skills add microsoft/eval-guide

Skills

Skill	Command	What it does
Eval Guide	`/eval-guide`	Full eval lifecycle — discover, plan, generate, run, interpret. Start here.
Eval Suite Planner	`/eval-suite-planner`	Populated Eval Suite Template workbook plus an interactive HTML review page for eval sets, methods, gates, human inputs, and grader-validation notes
Eval Generator	`/eval-generator`	Test cases for single-response and conversation (multi-turn) evaluation modes
Eval Result Interpreter	`/eval-result-interpreter`	SHIP / ITERATE / BLOCK verdict with root cause classification
Eval Triage & Improvement	`/eval-triage-and-improvement`	Interactive diagnosis and remediation for failing evals
Eval FAQ	`/eval-faq`	Methodology questions answered from Microsoft's eval ecosystem

Quick start

> /eval-guide

Tell me about your agent — what does it do, who uses it, and what does "good" look like?

Works the same in both Claude Code and GitHub Copilot.

The toolkit walks you through five operational stages over Microsoft's Practical Guidance on Agent Evaluation — 10-step playbook (the canonical methodology spine, in skills/eval-guide/playbook.md):

Stage	What happens	Playbook steps	Works without a running agent?
0. Discover	Articulate what the agent does, what success looks like, the eval objective, the agent's risk tier (5 factors), and the owner	Step 1	Yes
1. Plan	Scope eval depth by agent architecture; plan capability vs trust & safety eval sets; set pass-rate targets and hard/soft gates; specify human inputs + source→ground-truth map	Steps 1, 4, 5	Yes
2. Generate & Baseline	Produce capability and trust & safety test-case CSVs (single-response) or conversation blueprints (multi-turn) importable into Copilot Studio; design the regression partition	Steps 2, 3, 8 (design)	Yes
3. Run	Execute the baseline against a live agent	Step 6	Needs running agent
4. Interpret & Improve	Triage results, classify each failure (eval-setup vs agent-quality), gate-based verdict, design the optimization loop, flag reusable assets	Steps 7, 9, 10	Needs eval results

Stages 0-2 work from just an agent description — no running agent required.

Interactive dashboard review

Each stage generates an interactive HTML dashboard served locally in your browser. You review, edit inline, and confirm before the AI proceeds — no more back-and-forth in chat to fix test cases.

Stage complete → Dashboard opens → You review & edit → Confirm → Final artifacts generated

Stage	What you review in the dashboard	What you can edit
0. Discover	Agent Vision (purpose, users, knowledge, capabilities, boundaries, success criteria)	All fields inline, add/remove list items
1. Plan	Populated Eval Suite Template workbook plus HTML review page	Edit workbook cells without changing template structure; use the page to review summary, filters, TBDs, and checklist
2. Generate	Test cases per eval set	Edit expected responses, questions, methods, add/remove cases
4. Interpret	Verdict, failure triage, root causes, actions	Reclassify root causes, add comments

Final deliverables (.docx reports, .csv test sets) are only generated after you confirm via the dashboard.

The dashboard is a standalone HTML file generated by skills/eval-guide/dashboard/serve.py (zero dependencies) and opened directly in your browser — no server required. Feedback auto-saves as you edit via localStorage — if the browser closes, your work is preserved.

Architecture-aware eval scoping

The toolkit automatically scopes evaluation depth based on your agent's architecture:

View full README on GitHub

eval-guide

Popularity

What's Inside

Confidence

README

eval-guide

Install

Claude Code

GitHub Copilot

Skills

Quick start

Interactive dashboard review

Architecture-aware eval scoping

Similar Plugins

evalview

evaluate-agent

evaluation

agentic-usability

More by microsoft

webwright

agt-governance

azure-sdk-java

azure-sdk-rust

azure-sdk-python

eval-guide

Install

Claude Code

GitHub Copilot

Skills

Quick start

Interactive dashboard review

Architecture-aware eval scoping

More by microsoft

webwright

agt-governance

azure-sdk-java

azure-sdk-rust

azure-sdk-python

Popularity

Health & Quality

Similar Plugins

evalview

evaluate-agent

evaluation

agentic-usability

agent-eval-harness

iFixAi Diagnostic