From grimoire
Builds structured evaluation suites for LLM and AI system performance using reproducible metrics. Use when testing model quality, prompt changes, or regression detection.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:write-eval-suiteThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Build a structured evaluation suite that measures LLM or AI system performance with reproducible, comparable metrics.
Build a structured evaluation suite that measures LLM or AI system performance with reproducible, comparable metrics.
Adopted by: OpenAI (public Evals framework), Stanford (HELM — Holistic Evaluation of Language Models), EleutherAI (LM Evaluation Harness) Impact: HELM evaluates 30+ models across 42 scenarios and 7 metric categories; OpenAI uses community evals to discover model regressions before release — systematic evals caught GPT-4 Turbo regressions not visible to internal red-teaming.
Why best: Evals are to AI systems what unit tests are to software: they make quality measurable, regressions detectable, and improvements verifiable. Without them, "the model got better" is a belief, not a fact. A good eval suite is the single most durable investment in a production AI system.
lm-evaluation-harness, or a custom runner. Each eval: input → model call → output → scoring function → metric aggregation.output.strip() == expected. Model-graded: structured prompt asking judge model to rate 1-5 with reasoning. Code eval: execute output, check return value or stdout.Eval structure (OpenAI Evals format):
{"input": [{"role": "user", "content": "Summarize: [article]"}], "ideal": "The article discusses..."}
{"input": [{"role": "user", "content": "Extract the date from: [text]"}], "ideal": "2026-03-15"}
Scoring pipeline:
for example in eval_dataset:
output = model.complete(example["input"])
score = judge_model.grade(output, example["ideal"])
metrics.record(score)
print(f"Mean score: {metrics.mean():.3f} ± {metrics.ci():.3f}")
npx claudepluginhub jeffreytse/grimoire --plugin grimoireBuilds rigorous LLM evaluation pipelines with golden datasets, metrics, and automated evaluators to ensure AI feature quality and prevent regressions.
Use this skill when the user asks to "design an eval suite", "build evals for my AI feature", "create an evaluation framework", "how do I evaluate my AI", "what evals should I run", "build an eval system", or wants to create a systematic evaluation framework for an AI-powered product feature. Typically run after error-analysis has identified the failure categories to prioritize.