agent-self-evaluation | everything-claude-code

Stats

Actions

Tags

agent-self-evaluation | everything-claude-code

Agent Self-Evaluation

完成复杂任务后，智能体暂停并按结构化 5 轴 rubric 评估自己的输出。这不是 pass/fail gate，而是刻意反思步骤，用于在用户发现前捕捉遗漏、标记过度自信和暴露可改进区域。

何时激活

写了跨 3+ 文件或 50+ 行的代码后
完成多步 workflow 后（implement → test → review）
经历 3+ 次尝试的调试会话后
产出设计文档、架构决策或书面分析后
用户问 “how good was that?” 或 “rate yourself”
任何 session Stop hook 结束时（如果已配置——见 references/hook-integration.md）

核心概念

5 个评估维度

维度	问题	捕捉什么
Accuracy	事实、声明和输出是否正确？	幻觉、错误 API 名、错误语法、虚假陈述
Completeness	是否覆盖用户要求的所有内容？	漏掉边缘情况、未处理错误路径、忘记需求、跳过子任务
Clarity	解释是否易懂且结构良好？	混乱解释、未定义术语、缺少上下文、啰嗦
Actionability	用户能否立即基于输出行动？	模糊建议、缺少步骤、只说“you should X”但不展示怎么做、没有验证路径
Conciseness	是否用了完成任务所需的最少文字/tokens？	冗余、过度解释、逐字重复用户问题、填充内容

评分尺度

5 — Exceptional: 没有合理改进空间
4 — Good: 只有小问题，没有实质缺口
3 — Adequate: 满足请求，但至少一个维度有明显弱点
2 — Weak: 存在影响可用性或正确性的清晰缺口
1 — Poor: 根本错过请求或包含重大错误

证据规则

每个低于 5 的分数都必须引用具体证据。3 分不能只说 “could be better”——必须说明到底缺什么或错什么。口诀：“展示缺口，不只是命名缺口。”

工作流

Step 1：收集原材料

收集要评估的内容：

- 原始用户请求（从对话中回读）
- 你的最终响应/输出（交付物）
- 任何验证正确性的工具输出（测试结果、exit codes、lint output）
- 任务期间收到的用户反馈（corrections、“try again”、“that's not right”）

Step 2：独立评分每个维度

逐个处理 5 个维度。每个维度：

阅读维度问题
在输出中寻找证据（或缺失证据）
给出 1-5 分
如果分数 < 5，写一句引用缺口的改进说明

不要先在脑中平均总分再倒推各项。每个维度重新独立评分。

Step 3：生成评估报告

使用 templates/evaluation-report.md 中的模板。报告必须包含：

- 一行 summary
- 5-axis scorecard（每轴分数 + 证据）
- Overall score（简单平均，保留 1 位小数）
- 按影响排序的 1-3 条具体改进
- Self-check: "Would the user agree with this assessment?"

Step 4：应用改进

如果任何维度 ≤ 3：

说明你会如何不同地做
如果缺口能在 < 30 秒内修复（缺链接、措辞不清），现在修
如果缺口需要返工，明确标记：“This axis scored [reason] because [evidence]. Re-running with [specific fix] would likely raise it to [score].”

代码示例

示例：良好评估（4+ 分）

Task: Add retry logic to HTTP client

Scorecard:
  Accuracy:    5 — All API calls correct. Verified: retries use
                  exponential backoff. No hallucinated methods.
  Completeness: 4 — Covered happy path + 3 error cases. Missing:
                  timeout handling for hung connections.
  Clarity:      5 — Code comments explain backoff formula.
                  PR description links to incident that motivated this.
  Actionability:5 — Single merge. No follow-up tasks. Tests pass.
  Conciseness:  4 — 47 lines total. The retry loop could be extracted
                  into a helper to drop ~8 lines.

Overall: 4.6 — One gap (timeout handling). Fix before merging.

示例：弱评估（2-3 分）

Task: Add retry logic to HTTP client

Scorecard:
  Accuracy:    2 — Used urllib3 which doesn't match our
                  httpx-based codebase. Wrong library.
  Completeness: 3 — Works for GET. POST/PUT not handled (user
                  said "all HTTP requests").
  Clarity:      4 — Code is readable. Good variable names.
  Actionability:2 — "Add tests" mentioned but no test file created.
                  User has to write tests before merging.
  Conciseness:  3 — 120 lines. The retry config is duplicated in
                  3 places instead of one shared RetryConfig object.

Overall: 2.8 — Wrong library used. Needs httpx rewrite.
  Fix accuracy first (switch to httpx), then extend to all
  HTTP methods, then consolidate config.

反模式

“Everything is a 5”

FAIL: Accuracy:    5 — All good.
   Completeness: 5 — Everything covered.
   Clarity:      5 — Clear.

没有引用证据。这是自我表扬，不是评估。真正的 5 需要证明没有可改进之处。

因 scope creep 过度扣分

FAIL: Completeness: 2 — Didn't handle WebSocket connections or
   gRPC streaming (user didn't ask for these)

只按用户实际要求评估，不按你可以额外构建什么评估。

用评估重新争论

FAIL: "As I said earlier, this approach is wrong. Score: 1"

评估对象是交付输出，不是重新争论已经做出的设计决策。如果方法错了，应在交付前捕捉。

把个人偏好当作客观缺口

FAIL: "Score: 3. I don't like Python decorators."

“Don't like” 不是证据。引用具体可读性、可测试性或正确性问题；否则保持 4+。

最佳实践

评估输出，不评估过程。 用户关心你交付了什么，不关心你迭代了几轮。
每个弱维度一个改进。 不要为一个维度列 5 件事——挑影响最大的缺口。
把改进绑定到用户影响。 “Missing error handling means the user's API call will crash silently” 比 “add error handling” 好。
明确什么算 fixed。 “Re-run with httpx transport configured for retries” 比 “fix the library issue” 好。
用工具输出作证据。 测试通过就引用；lint 干净就引用。不要猜——grep 找证据。
如果找不到缺口，再认真找。 五个维度全满分很少见。问自己：“如果我是用户，这个输出哪里会烦到我？”

相关技能

agent-eval — 在 benchmark tasks 上比较不同 coding agents
verification-loop — 按预期结果系统验证输出
security-review — 安全导向代码审查清单