By microsoft
Plan, generate, and triage evaluations for Microsoft Copilot Studio agents. Start from an agent description to populate an eval planning workbook, then generate importable test sets (CSV + .docx manifest). After running evals, produce structured SHIP/ITERATE/BLOCK triage reports with failure root-cause diagnosis and actionable fixes, all grounded in Microsoft's Eval Scenario Library and Triage & Improvement Playbook.
Answers AI agent evaluation methodology questions with practical, opinionated guidance grounded primarily in Microsoft's agent evaluation ecosystem (MS Learn, Eval Scenario Library, Triage & Improvement Playbook, Eval Guidance Kit) supplemented by select industry sources.
Generate standalone — turns the populated Eval Suite Planning workbook (output of `/eval-suite-planner`) into concrete capability eval sets and trust & safety eval sets. Delivers playbook Steps 2 & 3 and designs the Step 8 regression partition. Outputs 2-column Copilot Studio `-for-import.csv` files (Question + Expected response only), a customer-ready `.docx` manifest report, and an `eval-setup-guide.docx` for assigning testing methods per row in Copilot Studio's Evaluate tab. Use after planning, before running.
Analyzes Copilot Studio evaluation results using Practical Guidance on Agent Evaluation's 10-step playbook (Steps 6, 7, and 9) plus Microsoft's triage diagnostics. Returns a gate-based SHIP / ITERATE / BLOCK verdict with root cause classification, remediation, and pattern analysis.
Plan standalone — populates the Eval Suite Planning & Logging Template from an Agent Vision or plain-English agent description. Grounded in Practical Guidance on Agent Evaluation v5: Step 1 planning, Steps 2-3 eval-set decomposition, Step 4 gates/improvement targets, Step 5 human inputs, Step 6 grader-validation planning, Step 7 baseline placeholders, Step 8 regression partitioning, and Step 10 reusable-asset candidates. Output is a template-preserving `.xlsx` workbook plus an interactive HTML review page. Use before generating test cases or running evals.
Use this skill when the user's Copilot Studio agent evaluations have come back and they need to interpret scores, diagnose root causes of underperforming test cases, find remediation steps, or analyze patterns to improve their agent. Always use this skill when the user mentions: "eval failed", "why did this fail", "triage", "diagnose failure", "low pass rate", "fix evaluation results", "not passing", "failing test cases", "evaluation results", "improve my eval scores", or any situation where eval scores need interpretation and action.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
AI agent evaluation toolkit for Copilot Studio. Plan evals, generate test cases, interpret results, and triage failures — from Claude Code or GitHub Copilot.
Grounded in Microsoft's Eval Scenario Library, Triage & Improvement Playbook, Common Evaluation Approaches, and MS Learn agent evaluation documentation.
claude plugin marketplace add microsoft/eval-guide
claude plugin install eval-guide@eval-guide
npx skills add microsoft/eval-guide
| Skill | Command | What it does |
|---|---|---|
| Eval Guide | /eval-guide | Full eval lifecycle — discover, plan, generate, run, interpret. Start here. |
| Eval Suite Planner | /eval-suite-planner | Populated Eval Suite Template workbook plus an interactive HTML review page for eval sets, methods, gates, human inputs, and grader-validation notes |
| Eval Generator | /eval-generator | Test cases for single-response and conversation (multi-turn) evaluation modes |
| Eval Result Interpreter | /eval-result-interpreter | SHIP / ITERATE / BLOCK verdict with root cause classification |
| Eval Triage & Improvement | /eval-triage-and-improvement | Interactive diagnosis and remediation for failing evals |
| Eval FAQ | /eval-faq | Methodology questions answered from Microsoft's eval ecosystem |
> /eval-guide
Tell me about your agent — what does it do, who uses it, and what does "good" look like?
Works the same in both Claude Code and GitHub Copilot.
The toolkit walks you through five operational stages over Microsoft's Practical Guidance on Agent Evaluation — 10-step playbook (the canonical methodology spine, in skills/eval-guide/playbook.md):
| Stage | What happens | Playbook steps | Works without a running agent? |
|---|---|---|---|
| 0. Discover | Articulate what the agent does, what success looks like, the eval objective, the agent's risk tier (5 factors), and the owner | Step 1 | Yes |
| 1. Plan | Scope eval depth by agent architecture; plan capability vs trust & safety eval sets; set pass-rate targets and hard/soft gates; specify human inputs + source→ground-truth map | Steps 1, 4, 5 | Yes |
| 2. Generate & Baseline | Produce capability and trust & safety test-case CSVs (single-response) or conversation blueprints (multi-turn) importable into Copilot Studio; design the regression partition | Steps 2, 3, 8 (design) | Yes |
| 3. Run | Execute the baseline against a live agent | Step 6 | Needs running agent |
| 4. Interpret & Improve | Triage results, classify each failure (eval-setup vs agent-quality), gate-based verdict, design the optimization loop, flag reusable assets | Steps 7, 9, 10 | Needs eval results |
Stages 0-2 work from just an agent description — no running agent required.
Each stage generates an interactive HTML dashboard served locally in your browser. You review, edit inline, and confirm before the AI proceeds — no more back-and-forth in chat to fix test cases.
Stage complete → Dashboard opens → You review & edit → Confirm → Final artifacts generated
| Stage | What you review in the dashboard | What you can edit |
|---|---|---|
| 0. Discover | Agent Vision (purpose, users, knowledge, capabilities, boundaries, success criteria) | All fields inline, add/remove list items |
| 1. Plan | Populated Eval Suite Template workbook plus HTML review page | Edit workbook cells without changing template structure; use the page to review summary, filters, TBDs, and checklist |
| 2. Generate | Test cases per eval set | Edit expected responses, questions, methods, add/remove cases |
| 4. Interpret | Verdict, failure triage, root causes, actions | Reclassify root causes, add comments |
Final deliverables (.docx reports, .csv test sets) are only generated after you confirm via the dashboard.
The dashboard is a standalone HTML file generated by skills/eval-guide/dashboard/serve.py (zero dependencies) and opened directly in your browser — no server required. Feedback auto-saves as you edit via localStorage — if the browser closes, your work is preserved.
The toolkit automatically scopes evaluation depth based on your agent's architecture:
Turn your coding agent into a SOTA browser agent. Drives a local Playwright workspace via one bash command at a time, saving screenshots and an action log into final_runs/run_<id>/, and visually self-verifies the result.
AGT governance hooks and MCP tools for Claude Code sessions
Azure SDK patterns and best practices for Java developers covering AI, communication, storage, identity, monitoring, and management libraries.
Azure SDK patterns and best practices for Rust developers covering identity, Key Vault, storage, Cosmos DB, and Event Hubs.
Azure SDK patterns and best practices for Python developers covering AI, storage, identity, monitoring, messaging, and management libraries.
npx claudepluginhub microsoft/eval-guideOpen-source testing and regression detection framework for AI agents. Golden baseline diffing, CI/CD integration, works with LangGraph, CrewAI, OpenAI, Anthropic Claude, HuggingFace, Ollama, and MCP.
Set up evaluation of AI agents with tool call validation, correctness checks, task completion, and tool reliability using Dokimos. Framework-agnostic — works with any agent framework.
Measure AI output quality, user satisfaction, task success, and design effectiveness.
SDK Usability Benchmark — generate, execute, judge, and analyze AI agent benchmark suites
Agent and skill evaluation harness with MLflow integration
Investigate your agent, author an iFixAi fixture from its setup, and run iFixAi's operational-misalignment diagnostic against it — test any provider's model (Anthropic, OpenAI, Gemini, Azure, Bedrock, …) as system-under-test and judge, each billed to its own account via a key in your settings. Developer preview.