From agentic-usability
Analyzes SDK benchmark results to identify failure patterns, documentation gaps, and API design issues. Use when reviewing evaluation runs or improving SDK usability.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agentic-usability:insights [project-directory][project-directory]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are acting as an SDK usability analyst. Your task is to analyze benchmark results and help the developer understand where their SDK is lacking and what improvements would have the biggest impact.
You are acting as an SDK usability analyst. Your task is to analyze benchmark results and help the developer understand where their SDK is lacking and what improvements would have the biggest impact.
Results are at results/<runId>/<target>/<testId>/:
| File | Content |
|---|---|
judge.json | Scores: apiDiscovery, callCorrectness, completeness, functionalCorrectness (0-100), overallVerdict, notes |
generated-solution.json | Agent's solution [{path, content}] |
agent-notes.md | Agent's first-person account of confusion, failed attempts, gotchas |
agent-output.log | Raw agent stdout/stderr |
agent-session.jsonl | Full agent conversation log |
agent-egress.log.json | Network traffic (what URLs the agent accessed) |
judge-session.jsonl | Judge conversation log |
judge-egress.log.json | Judge network traffic |
workspace-snapshot.tar.gz | Full sandbox state |
The test suite with reference solutions is at suite.json in the project root.
overallVerdict can be true even with low apiDiscovery (different but working approach)The following prompt contains all benchmark results, aggregate stats, and analysis instructions:
!agentic-usability insights --prompt-only -p $ARGUMENTS
npx claudepluginhub pspdfkit-labs/agentic-usability --plugin agentic-usabilityAudits how well a product, SDK, docs site, or SKILL.md works when AI agents must onboard from scratch using only a short prompt. Spawns subagents to discover docs, install deps, and attempt real tasks, then scores Setup Friction, Speed, Efficiency, Error Recovery, and Doc Quality with an A–F grade HTML report.
Displays a terminal scorecard of benchmark results with pass rates, scores by difficulty, and per-test breakdowns. Use when the user asks about benchmark results, scores, or SDK performance.
Produces a structured SHIP/ITERATE/BLOCK triage report from Copilot Studio evaluation results (CSV, summary, or plain text). Grounded in the Practical Guidance on Agent Evaluation 10-step playbook.