From agent-skills
Builds rigorous LLM evaluation pipelines with golden datasets, metrics, and automated evaluators to ensure AI feature quality and prevent regressions.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-skills:eval-driven-developmentThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Eval-Driven Development ensures that AI features behave deterministically and predictably by testing them against a golden dataset using automated evaluators.
Eval-Driven Development ensures that AI features behave deterministically and predictably by testing them against a golden dataset using automated evaluators.
| Rationalization | Why It Is Wrong |
|---|---|
| "Manual spot checks are enough." | Spot checks miss regressions across prompts, model versions, and edge cases. |
| "We can add evals after launch." | Without a baseline, you cannot tell whether a later prompt or model change improved behavior. |
| "The judge model says it is good." | LLM judges need criteria, calibration examples, and failure review before they are trustworthy. |
Before finishing, confirm:
npx claudepluginhub ishandutta2007/awesome-agent-skillsGuides building evals before prompts for LLM features, agents, or prompts. Helps measure improvement objectively and avoid speculative iteration.
Designs, tests, compares, versions, and validates prompts or LLM behavior using measurable criteria and datasets. Useful when evaluating prompt quality, edge cases, and deployment readiness.
Generates 20 test cases (15 happy path + 5 edge) for AI features in spreadsheet format using PM-Friendly Evals. Launches simple eval workflow with optional Linear project.