自评输出质量:按准确性、完整性、清晰度、可执行性、简洁性五个维度生成1-5评分卡和具体改进建议。适合复杂代码或设计产出后的反思步骤。
How this skill is triggered — by the user, by Claude, or both
Slash command
/everything-claude-code:agent-self-evaluationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
完成复杂任务后,智能体暂停并按结构化 5 轴 rubric 评估自己的输出。这不是 pass/fail gate,而是刻意反思步骤,用于在用户发现前捕捉遗漏、标记过度自信和暴露可改进区域。
完成复杂任务后,智能体暂停并按结构化 5 轴 rubric 评估自己的输出。这不是 pass/fail gate,而是刻意反思步骤,用于在用户发现前捕捉遗漏、标记过度自信和暴露可改进区域。
references/hook-integration.md)| 维度 | 问题 | 捕捉什么 |
|---|---|---|
| Accuracy | 事实、声明和输出是否正确? | 幻觉、错误 API 名、错误语法、虚假陈述 |
| Completeness | 是否覆盖用户要求的所有内容? | 漏掉边缘情况、未处理错误路径、忘记需求、跳过子任务 |
| Clarity | 解释是否易懂且结构良好? | 混乱解释、未定义术语、缺少上下文、啰嗦 |
| Actionability | 用户能否立即基于输出行动? | 模糊建议、缺少步骤、只说“you should X”但不展示怎么做、没有验证路径 |
| Conciseness | 是否用了完成任务所需的最少文字/tokens? | 冗余、过度解释、逐字重复用户问题、填充内容 |
5 — Exceptional: 没有合理改进空间
4 — Good: 只有小问题,没有实质缺口
3 — Adequate: 满足请求,但至少一个维度有明显弱点
2 — Weak: 存在影响可用性或正确性的清晰缺口
1 — Poor: 根本错过请求或包含重大错误
每个低于 5 的分数都必须引用具体证据。3 分不能只说 “could be better”——必须说明到底缺什么或错什么。口诀:“展示缺口,不只是命名缺口。”
收集要评估的内容:
- 原始用户请求(从对话中回读)
- 你的最终响应/输出(交付物)
- 任何验证正确性的工具输出(测试结果、exit codes、lint output)
- 任务期间收到的用户反馈(corrections、“try again”、“that's not right”)
逐个处理 5 个维度。每个维度:
不要先在脑中平均总分再倒推各项。每个维度重新独立评分。
使用 templates/evaluation-report.md 中的模板。报告必须包含:
- 一行 summary
- 5-axis scorecard(每轴分数 + 证据)
- Overall score(简单平均,保留 1 位小数)
- 按影响排序的 1-3 条具体改进
- Self-check: "Would the user agree with this assessment?"
如果任何维度 ≤ 3:
Task: Add retry logic to HTTP client
Scorecard:
Accuracy: 5 — All API calls correct. Verified: retries use
exponential backoff. No hallucinated methods.
Completeness: 4 — Covered happy path + 3 error cases. Missing:
timeout handling for hung connections.
Clarity: 5 — Code comments explain backoff formula.
PR description links to incident that motivated this.
Actionability:5 — Single merge. No follow-up tasks. Tests pass.
Conciseness: 4 — 47 lines total. The retry loop could be extracted
into a helper to drop ~8 lines.
Overall: 4.6 — One gap (timeout handling). Fix before merging.
Task: Add retry logic to HTTP client
Scorecard:
Accuracy: 2 — Used urllib3 which doesn't match our
httpx-based codebase. Wrong library.
Completeness: 3 — Works for GET. POST/PUT not handled (user
said "all HTTP requests").
Clarity: 4 — Code is readable. Good variable names.
Actionability:2 — "Add tests" mentioned but no test file created.
User has to write tests before merging.
Conciseness: 3 — 120 lines. The retry config is duplicated in
3 places instead of one shared RetryConfig object.
Overall: 2.8 — Wrong library used. Needs httpx rewrite.
Fix accuracy first (switch to httpx), then extend to all
HTTP methods, then consolidate config.
FAIL: Accuracy: 5 — All good.
Completeness: 5 — Everything covered.
Clarity: 5 — Clear.
没有引用证据。这是自我表扬,不是评估。真正的 5 需要证明没有可改进之处。
FAIL: Completeness: 2 — Didn't handle WebSocket connections or
gRPC streaming (user didn't ask for these)
只按用户实际要求评估,不按你可以额外构建什么评估。
FAIL: "As I said earlier, this approach is wrong. Score: 1"
评估对象是交付输出,不是重新争论已经做出的设计决策。如果方法错了,应在交付前捕捉。
FAIL: "Score: 3. I don't like Python decorators."
“Don't like” 不是证据。引用具体可读性、可测试性或正确性问题;否则保持 4+。
agent-eval — 在 benchmark tasks 上比较不同 coding agentsverification-loop — 按预期结果系统验证输出security-review — 安全导向代码审查清单npx claudepluginhub aaione/everything-claude-code-zhSelf-rates agent output on 5 axes (accuracy, completeness, clarity, actionability, conciseness) with concrete evidence per criterion, producing a structured 1-5 scorecard with improvement suggestions.
Scores own output 0-10 across 5 task-appropriate dimensions before emitting. Used as a pre-emit gate for complex work where grade-inflation is a risk.
Assesses code, designs, or approaches with 0-10 rating, pros/cons analysis, and actionable recommendations. Use for evaluating quality or trade-offs.