From safe-browser
Audits how well a product, SDK, docs site, or SKILL.md works when AI agents must onboard from scratch using only a short prompt. Spawns subagents to discover docs, install deps, and attempt real tasks, then scores Setup Friction, Speed, Efficiency, Error Recovery, and Doc Quality with an A–F grade HTML report.
How this skill is triggered — by the user, by Claude, or both
Slash command
/safe-browser:agent-experienceThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Evaluate how well a product/SDK/docs surface works when an AI agent actually tries to onboard and do a realistic task — **starting from a short one-sentence prompt**, with nothing pasted in. The agent must find the docs, install what it needs, and attempt real work. That's the only honest test of agent DX.
Evaluate how well a product/SDK/docs surface works when an AI agent actually tries to onboard and do a realistic task — starting from a short one-sentence prompt, with nothing pasted in. The agent must find the docs, install what it needs, and attempt real work. That's the only honest test of agent DX.
The skill spawns multiple subagents in parallel, captures each one's tool-call trace, and scores the experience using the same dimensions as the Skill Test Arena dashboard: Setup Friction, Speed, Efficiency, Error Recovery, Doc Quality.
Do not spoonfeed. The subagent gets a tiny prompt like "Get started with {product} and {do its primary thing}". It must discover the docs, choose the path, and hit real failures. A good doc survives this; a bad doc does not.
Execute these steps in order. Do not skip ahead.
Resolve what the user is asking to audit. The target may arrive in one of three forms:
https://docs.example.com). This is the seed the subagents start from.AskUserQuestion for the URL or repo.This skill is product-agnostic. Never assume what the user wants to audit. Do not infer a target from environment signals (operator's email domain, git remote, repo name, recent files, memory, CLAUDE.md). Even if context strongly suggests a particular company, the user-facing question must NOT pre-fill or default to any specific product, URL, or company name. Ask open-endedly with neutral options only: e.g., "Paste a URL", "Paste a local path", "Type a product name". If the user did not name a target in their invocation, ask them — start fresh, no priors.
Research lightly after the user has named a target. 1 WebFetch max, enough to confirm: what is this product, and does it have a getting-started guide? You're identifying that there is a flow to follow, not extracting the steps. The whole point is to let the docs dictate the path.
Define ONE abstract goal, not a step-by-step checklist. The goal should be at the level of "complete the onboarding" or "make the product do its primary thing once" — NOT a list of specific actions.
Why: prescriptive checklists steer agents. If you tell them "navigate to example.com" but the docs' quickstart navigates to a different URL, the agent is torn between your instruction and the docs. That pollutes the test.
Examples of good abstract goals (the target product is supplied by the user — the examples below are illustrative only, not defaults):
Examples of BAD goals (too prescriptive — don't do this):
The subagent will self-report against the abstract goal: did I complete the onboarding as the docs described? (yes / no / partial). The concrete sub-outcomes the agent actually achieved live in their trace under primary_outcome_achieved, not in a pre-defined checklist.
If the target has no clear getting-started flow (rare — even a README is a flow), ask the user what "done" means before continuing.
Use AskUserQuestion in a single call with 4 questions. Options: max 4 per question.
Test depth (single-select, header: "Depth"):
5 agents (Recommended) — balanced coverage3 agents — quick sanity check10 agents — thorough, higher costProgramming languages (multiSelect, header: "Languages"): pick up to 4 — Python, TypeScript, Go, Shell/Bash (let user deselect).
Personas (multiSelect, header: "Personas"):
Standard (Recommended) — neutral baseline, no behavioral flavoring. Just "do the task." Best for unbiased measurement.Pragmatic — just get it working, fastest pathThorough — read the docs end-to-end before codingSkeptical — verify claims the docs makeExecution mode (single-select, header: "Exec mode"):
Allow Bash (Recommended) — subagents can run npm install, curl, etc. on your machine. Most realistic.Draft-only — subagents may fetch docs and write code but won't execute anything. Safer.After the user answers, gather one more question about model choice:
"Model"):
Sonnet (Recommended) — balanced cost/quality, default for most auditsOpus — strongest reasoning, highest cost; good for dense/ambiguous docsHaiku — cheapest, fastest; good for checking if docs are agent-friendly to smaller modelsMixed comparison — split agents across Opus + Sonnet + Haiku so you can see how doc quality varies by model size. Useful for "are my docs robust even to weaker models?"Pass the chosen model to each Agent invocation via the model parameter. If Mixed, distribute N agents roughly equally across the 3 models (round-robin by slot index) and record which model each agent used in the trace + report.
After the user answers, you have: depth (N), languages[], personas[], exec_mode, model.
If exec_mode = "Allow Bash", follow up with a second AskUserQuestion asking about credentials:
"Credentials"):
Auto-discover (Recommended) — skill checks the user's env vars, common dotfiles, and credential managers; only prompts for paste if nothing found. Best for repeat use and for cases where another operator is running the audit.None — let agents block (friction test) — agents hit the credential wall, counts as Setup Friction. Best for pure docs audits.Paste manually — you paste keys directly; skill injects them. Use when you don't have keys stored locally yet.If user picks Auto-discover, run Step 2.5 below before continuing. If Paste manually (or auto-discover falls back), AskUserQuestion asks for the credential values — not the names. The skill then writes them to each workspace .env using generic, product-agnostic names:
API_KEYPROJECT_IDSECRETDo NOT use product-specific names like BROWSERBASE_API_KEY, EXA_API_KEY, STRIPE_SECRET_KEY. Those names steer the agent — they see BROWSERBASE_API_KEY in env and skip ever reading the docs to find out what env var the SDK actually expects. The generic name forces them to:
BROWSERBASE_API_KEY).API_KEY value into whatever form the SDK requires — either re-export (export BROWSERBASE_API_KEY=$API_KEY) or pass inline in code (new Browserbase({ apiKey: process.env.API_KEY })).If an agent fails to figure out the mapping, that's a doc quality signal — the docs weren't clear about credential naming.
Auto-discover)Run a tiered lookup. Stop at the first tier that produces a usable candidate. Never print credential values to chat — only names and source paths. The user picks by name; the skill internally maps name → value → workspace .env.
Derive the product slug from the target URL/repo to bias toward relevant matches. e.g. https://docs.browserbase.com → slug browserbase. Use lowercase substring match (case-insensitive) when ranking candidates.
Tier 1 — Already-exported env vars (free, zero side effects):
printenv | grep -iE '^[A-Z][A-Z0-9_]*_(API_KEY|TOKEN|SECRET|KEY)=' | cut -d= -f1
This returns names only. If any names contain the product slug, those are top candidates.
Tier 2 — Narrow dotfile scan (a hardcoded short list, NOT a recursive grep):
grep -hE '^[[:space:]]*export[[:space:]]+[A-Z][A-Z0-9_]*_(API_KEY|TOKEN|SECRET|KEY)=' \
~/.zshrc ~/.bashrc ~/.bash_profile ~/.zprofile ~/.env ./.env ./.envrc 2>/dev/null \
| sed -E 's/^[[:space:]]*export[[:space:]]+([A-Z0-9_]+)=.*/\1/' \
| sort -u
Files allowed: ~/.zshrc, ~/.bashrc, ~/.bash_profile, ~/.zprofile, ~/.env, ./.env, ./.envrc. Do NOT expand this list. Do NOT recurse. Do NOT scan ~/Library, ~/.config/, ~/Documents, etc. This is the entire allowlist; anything else is out of scope and risks leaking unrelated secrets.
For each match, record (NAME, source_path). Read the value lazily — only when the user has confirmed the choice — by re-grepping the specific source file for that exact name.
Tier 3 — Credential manager (only if op or security is on PATH AND tiers 1–2 had no good match):
op account list exits 0 (i.e. user is signed in). Don't trigger an interactive auth flow inside the skill.security find-generic-password -l "<expected-name>" -w — try once with the most likely name (e.g. BROWSERBASE_API_KEY); silent failure means not stored.If a credential manager produces hits, list them as candidates the same way as tiers 1–2.
Tier 4 — Fallback to paste: If all tiers above produced zero candidates, fall through to the manual paste flow described in Step 2.
Presenting candidates to the user. After tiers 1–3:
Using BROWSERBASE_API_KEY from ~/.zshrc. (Name + source only — never the value.)"Use which credential?") with up to 4 options:
<NAME> (from <source>)Paste manually instead escape hatchShow all option that re-asks with the rest.Reading the value. Once the user has confirmed a choice (or it was auto-selected), read the value:
printenv <NAME> (capture stdout, do not echo).export <NAME>= line and parse the RHS, stripping surrounding quotes.op read "op://<vault>/<item>/<field>" or security find-generic-password -l <NAME> -w.Write the value into per-agent workspace .env files using the same generic names (API_KEY, PROJECT_ID, SECRET) as the paste flow — see Step 2. The discovery layer is upstream of injection; downstream behavior (generic names, agent must read docs to map them) is unchanged.
Orchestrator-retained credentials. After writing per-agent .env files, the orchestrator keeps the original product-specific names → values (e.g. BROWSERBASE_API_KEY) available to itself for downstream verification work in Steps 6 / 6.5 / 8 — for example, calling the product's API with curl to confirm that a session ID an agent reported actually resolves, or fetching session metadata to enrich the report. The orchestrator can read them with printenv (no need to store anywhere — the parent shell already has them since auto-discover sourced them from there).
This is asymmetric on purpose: the subagents see only generic API_KEY / PROJECT_ID / SECRET so the doc-quality test stays honest (they must read the docs to discover the real var name). The orchestrator is not being audited, so it can use the real names freely for verification.
Privacy guarantees the skill must uphold:
.env.If exec_mode = "Allow Bash", print a brief warning to chat before spawning: "Agents may run real shell commands (npm install, curl, pip install, git clone) on this machine. Make sure you're in a directory you're okay with agents modifying. Continue in 5 seconds or Ctrl-C to abort." — then continue.
Do not run sleep — just proceed after printing. The user reads the warning before the agents start working.
For each of N variants, produce a (persona, language, prompt) tuple. The prompt is one or two sentences, stating the abstract goal + language. No sub-checklist, no prescriptive steps.
Template:
{persona_prefix} {product}'s getting-started guide using {language}.{persona_tail} You've completed it when you've done whatever the guide treats as the primary successful outcome.
{persona_tail} is empty for most personas. The Skeptical persona uses it to inject its "note anything wrong" guidance as a separate sentence (with a leading space) so the prefix sentence stays grammatical. See references/prompt-variants.md for the full prefix/tail per persona.
Examples (using Acme as a placeholder — substitute the user-supplied product name):
The subagent is NOT told what the success outcome is — they have to read the docs to figure that out. That's the point: if the docs are good, they'll convey it clearly. If the docs are bad, the agent won't know when they're done, which IS a finding.
Read references/prompt-variants.md for the persona prefix library. Cross-product personas × languages, truncate to N. If cells < N, repeat with slight wording variation on the prefix.
Never paste doc content into the prompt.
Read references/subagent-brief.md — the full brief each subagent receives. It tells them:
WebFetch, Bash if allowed, Write)For each variant, invoke the Agent tool (subagent_type: general-purpose). Pass model: "opus" | "sonnet" | "haiku" per the user's choice. For Mixed, rotate models across the N slots deterministically (agent 1 → opus, agent 2 → sonnet, agent 3 → haiku, agent 4 → opus, …) and record the assigned model in the per-agent report row.
All N calls in one message so they run in parallel.
The subagent's prompt = the brief + their tiny task. The brief passes through exec_mode so the subagent knows whether Bash is available.
Wait for all N agents to return before continuing to Step 6. When agents are run in the background, completion notifications arrive one at a time and it is easy to lose count. Maintain a simple in-memory tally of returned-vs-spawned and, when the last agent reports back, print one explicit milestone line to chat: "All N agents returned — moving to trace parsing." Do not start Step 6 until that line has been printed. If the user asks "are the agents still running?" mid-flight, answer with the current <returned>/<spawned> count from your tally, not from re-counting prior chat output.
Verification of agent claims using orchestrator credentials. Before scoring, if Step 2.5 retained product-specific credentials, the orchestrator may use them to spot-check claims that subagents made (e.g. confirming a session ID with curl -H "X-BB-API-Key: $BROWSERBASE_API_KEY" https://api.browserbase.com/v1/sessions/<id>). Treat any unresolved IDs as evidence the agent may have hallucinated. Never include the credential header in the report — only the verification result (resolved / not resolved).
Each subagent returns two things in one response:
Retain both. Do not throw the prose away after extracting JSON. The prose is where you catch things the JSON self-report misses.
Extract JSON using: /```json\s*(\{[\s\S]*?\})\s*```\s*$/. Mark malformed/missing as errored with a raw_tail. If >50% errored, warn and offer retry.
Compute the top-line numbers from the JSON:
onboarding_status = "completed".docs_promise_met = true.Subagents don't have search — they guess URLs from training-data priors. Reports must show per WebFetch call where the URL came from, rendered as a small muted line directly under the tool input block in the trace. Do NOT put this at the top of the report as a general callout — it's only useful inline where the reader can correlate it to the specific call.
Classify each WebFetch URL into one of four provenance categories and render with the matching label + color:
TRAINING PRIOR (violet) — URL is a guess from training data (product name + common doc-site conventions like /introduction, /quickstart, /sdk/{lang}). Typical for the first 1–2 WebFetch calls.FROM LLMS.TXT (blue) — URL appears in the output of a prior llms.txt fetch in the same trace.FROM PREV PAGE (green) — URL was listed in the output of a previous WebFetch or Bash tool call in the same trace.GUESS · 404 (amber) — URL was guessed but 404'd — this is the most interesting category for doc-quality scoring (the URL should exist by convention but doesn't).Classification heuristic:
llms.txt WebFetch whose output mentioned this URL → FROM LLMS.TXTFROM PREV PAGEerror: true with 404 content → GUESS · 404TRAINING PRIORScore interpretation:
TRAINING PRIOR that succeed = product is well-represented in training data (head start).GUESS · 404 = URL taxonomy drifts from common conventions → real doc-discoverability finding.FROM LLMS.TXT appearing often after GUESS · 404 = llms.txt is carrying the docs' discoverability. Credit it explicitly in the findings.Before scoring, re-read the full prose from every agent. The JSON trace is the agent's self-report — an agent that hallucinated a wrong package name will also describe it correctly in its own trace. The truth lives in the tool output and the prose.
Scan for these patterns across the N transcripts:
Convergent mistakes. Did multiple agents try the same wrong thing? Wrong npm package name (e.g., exa vs exa-js), wrong endpoint, wrong env var, wrong import? If 3/3 agents used the wrong package, that's a doc quality disaster even if each "completed" the task. Agents don't invent identical wrong answers — shared training-data residue means the docs aren't overriding the model's wrong priors.
Hallucinated artifacts. Compare each agent's primary_outcome_achieved claim against what their tool output actually shows. If they claim "printed the title" but no title-fetching tool call appears in their Bash output, they're confabulating. Likely means the doc was unclear enough that the agent pattern-matched instead of reading.
Inconsistent outcomes. If 3 agents describe 3 different "successful" end-states, the docs don't clearly define success.
Silent workarounds. Did agents patch a bug (missing await, wrong env var name, undocumented required parameter) that a human copy-paster wouldn't have? Flag these — they're invisible DX taxes only captured in prose.
Tool-output vs. narrative contradictions. Sometimes an agent says "it worked" but the stderr from their Bash call says otherwise, and they failed to notice. Grep tool outputs in the prose for error, 404, 401, deprecated, warning.
Write a 3–5 sentence Narrative Review summary and include it prominently in the final report. This often surfaces the highest-value findings of the whole audit.
Read references/evaluation-rubric.md for full criteria. Score 0–100 based on aggregated evidence.
Onboarding success rate is the primary sanity check. See references/evaluation-rubric.md § 0 for the exact cap tiers — at <50% completion, every dimension is capped at 55 regardless of other evidence.
completed_subtasks / total tool_calls ratio, wasted calls.Weighted total → letter grade (90+ A, 75+ B, 60+ C, 45+ D, else F).
Produce:
Read assets/report-template.html. Fill placeholders:
{{TITLE}}, {{TARGET_REF}}, {{META}}, {{GRADE_LETTER}}, {{GRADE_CLASS}}, {{OVERALL_SCORE}}, {{AGENT_COUNT}}, {{COMPLETED_COUNT}}, {{PARTIAL_COUNT}}, {{STUCK_COUNT}}, {{BLOCKED_COUNT}}, {{ERRORED_COUNT}}, {{NARRATIVE_REVIEW_SECTION}} (see format below), {{EXEC_SUMMARY}}, {{WENT_WELL_ITEMS}}, {{DIDNT_GO_WELL_ITEMS}}, {{TIMELINE_SECTION}}, {{TOOL_BREAKDOWN_SECTION}}, {{METRICS_GRID}}, {{PATTERNS_SECTION}}, {{FIXES_LIST}}, {{AGENT_RESULTS_TABLE}} (at-a-glance summary table — see format below), {{AGENT_TRACES_SECTION}} (full collapsible per-agent trace cards — see format below).
Status counter mapping. The 5 status counters partition the agents exactly: every agent contributes to exactly one of {{COMPLETED_COUNT}} (onboarding_status=completed), {{PARTIAL_COUNT}} (partial), {{STUCK_COUNT}} (stuck), {{BLOCKED_COUNT}} (blocked-on-credentials), or {{ERRORED_COUNT}} (parser-failed traces). The five sub-counts must sum to {{AGENT_COUNT}}.
Section order in the rendered report (the template enforces this — do not reorder):
{{NARRATIVE_REVIEW_SECTION}}){{PATTERNS_SECTION}})Rationale: opinion before data. The reader needs the verdict (narrative + exec summary) and the actionable fix list before being asked to absorb metrics or timelines. Reference-y sections (timeline, tool breakdown) sit near the bottom for verification, not framing.
{{NARRATIVE_REVIEW_SECTION}} format. A <div class="narrative-review"> containing a <div class="label">Narrative Review</div> and a <div class="body">…</div> with the 3–5 sentence cross-agent summary from Step 6.5. This is the highest-value finding of the audit — keep prose tight, lead with the strongest observation. If Step 6.5 produced no notable cross-agent finding, render the section with a one-line body: No cross-agent patterns of note — agents converged on the docs' intended path with minor individual variation. Do not omit the section.
The 5 dimension scores are still computed (they feed the overall weighted score and letter grade), but do not render a per-dimension breakdown section — it adds visual weight without giving the reader anything actionable beyond what the narrative review and recommended fixes already cover. Keep dimension scoring internal.
{{AGENT_RESULTS_TABLE}} format. A <table class="agent-results-table"> rendered immediately above the per-agent cards. One row per agent with these columns (in order):
Standard · TypeScript. Use the values from the agent's JSON trace.model = Mixed (otherwise omit the column entirely; the single model is named in the header {{META}} line).<span class="status-pill {{status}}"> matching onboarding_status from the JSON (completed, partial, stuck, blocked-on-credentials). Map errored (parser-failed traces) to its own pill.count across the agent's tool_calls[] array. Right-aligned, monospace.wall_time_estimate_sec from the JSON, formatted as 92s (or 2m 14s if ≥120s). Right-aligned, monospace.Rationale: the cards below are detailed but require expanding each one. The table gives a one-screen comparison so the reader can spot outliers (the agent that took 3× as long, the one that fired 2× the tool calls) before drilling in.
{{AGENT_TRACES_SECTION}} format. One <details class="trace-card"> per agent. Each card's summary line MUST include the model used (e.g. <span class="chip">opus</span>) alongside persona/language chips. The card expands to show:
Event log (from detailed_trace) — rendered in compact Arena-style: monospace rows with color-coded bracketed labels, minimal chrome, no dots or timeline lines. Each row is one line of text; Input/Output blocks appear as indented <pre> blocks directly under their tool call (always visible, not click-to-expand — users want to scan the flow).
The FIRST event in every log is the prompt that was sent to that subagent, rendered with the gold [PROMPT] label at timestamp [setup]. The full prompt is behind a small click-to-expand button (the only collapsible in the stream — prompts are long and users don't always need them).
Visual structure:
[setup] [PROMPT] Task prompt sent to subagent [▸ Show full prompt]
[+0ms] [MILESTONE] agent_started
[+100ms] [THOUGHT] I'll start by discovering the docs.
[+1.2s] [TOOL_USE #1] WebFetch
Input: { "url": "...", "prompt": "..." }
[+3.4s] [TOOL_RESULT #1]
Output: # Example Product ...
[+4.5s] [TOOL_USE #2] Bash
Input: { "command": "npm install ...", "description": "..." }
[+9.2s] [TOOL_RESULT #2]
Output: added 12 packages ...
[+12s] [ERROR] install · PEP 668 blocked · recovered
[+45s] [RESULT ✓] Session created, task done.
CSS conventions (compact monospace, light background):
.trace-timeline — light gray background (#fafaf9), monospace font throughout, 0.78rem font-size, scrollable (max 640px)[time] (muted) + [LABEL] (colored, bold) + body content[PROMPT]: gold[MILESTONE]: blue[THOUGHT]: violet (body text also italic + muted)[TOOL USE]: orange (#ff6b35 or whatever brand accent the report uses)[TOOL RESULT]: green (or red if errored)[ERROR]: red (body also red)[RESULT ✓]: green (body green, bold).trace-io blocks with colored left-border (orange for input, green for output, red for errors). <pre> block shows the full tool input as JSON — never abbreviate. For WebFetch specifically, that means showing both the url AND the prompt args — the prompt is what the agent asked the page's content to be distilled to, and it's critical signal for understanding agent intent. If the input is large, truncate the value (not the structure) with … inside the relevant string.The main agent keeps each subagent's prompt. When spawning agents in Step 5, cache the full prompt text keyed by agent index so you can retrieve it for the report. Future-you (rendering) needs to look up what was sent to which agent.
The event log is the star of the show — this is what gives users the same "I can see exactly what the agent did and thought" experience as the Arena trace view. The prose summary is a narrative recap but the trace is the primary record.
Per-agent results table must include a Model column when model = Mixed, so cross-model comparison is visible at a glance. When a single model was used, mention it once in the header {{META}} line instead.
HTML-escape all user-supplied strings. Doc quotes go in <code> or <blockquote>.
All URLs must be clickable. When the report references:
/quickstart) → wrap as <a class="doc-link" href="{TARGET_BASE_URL}{path}" target="_blank" rel="noopener"><code>{path}</code></a> where {TARGET_BASE_URL} is the audit target's origin (e.g. https://docs.example.com)f0ec58cc) → link to the full resource URL (e.g. https://app.example.com/sessions/{full-id}) with a ↗ suffix indicating external link<a> not just <code>The CSS for these link classes:
a.doc-link { text-decoration: none; color: inherit; }
a.doc-link:hover code { background: #fff4ef; border-color: var(--brand); color: var(--brand); }
a.session-link { color: #166534; text-decoration: none; }
a.session-link:hover { text-decoration: underline; }
Rationale: a 404 finding is useless if the user can't click to verify. A session ID is useless if the user can't click through to the recording. Every URL-like string in the report should be one click away from verification.
Save to ./agent-experience-<slug>-<timestamp>.html (cwd). Slug = lowercase target basename with non-alphanumerics → -. Timestamp = YYYYMMDD-HHMMSS.
Print to chat:
Open via Bash: open <path> on macOS if exec_mode allowed it; otherwise just print the path.
If exec_mode = "Allow Bash" and you created per-agent subdirectories under ./dx-audit-tmp/ (or similar), delete that tree after the report is rendered:
rm -rf ./dx-audit-tmp/
Rationale: agents install node_modules, venvs, Go modules, etc. — often tens of MB per agent. Leaving them around pollutes the user's repo and wastes disk.
Exception: if a subagent's onboarding_status is stuck, or its trace was marked errored (JSON parse failed), leave that agent's subdir in place and note it in chat — the user may want to inspect the failing state. Delete only the completed / blocked-on-creds agents' dirs.
If exec_mode = "Draft-only", no cleanup is needed (no files were written outside the report).
references/evaluation-rubric.md — 5-dimension scoring rubric (Arena methodology).references/prompt-variants.md — Persona prefix library and core-task heuristics.references/subagent-brief.md — Verbatim brief + trace JSON schema.assets/report-template.html — HTML template with placeholders, stamped into the final report.exec_mode = Draft-only must disable Bash execution in the subagent brief.npx claudepluginhub browserbase/skills --plugin browseAudits agents and skills across the plugin ecosystem for quality, consistency, and correct invocation configuration. Use after creating or modifying multiple skills.
Runs evaluations on ADK agents: writing eval datasets, analyzing failures, comparing results, and optimizing agents using the Quality Flywheel methodology.
Audits Claude Code agents for violations, gaps, and improvements across 7 dimensions like description quality and frontmatter, outputting structured repair plans.