From agent-loops
Sizes and preregisters a two-arm comparison (A/B test, RCT, behavioral study, model evaluation) by computing required sample size via Monte Carlo simulation, auditing design validity, and producing a preregistration. Only for single two-arm comparisons with one continuous or binary primary outcome.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-loops:power-analysisThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A **power-analysis-and-preregister** loop for a **two-arm comparison**. The artifact is the study's
A power-analysis-and-preregister loop for a two-arm comparison. The artifact is the study's statistical plan; the feedback signal is two parts — statistical power (estimated by Monte-Carlo simulation of the planned test) and a count of validity flaws. Each iteration simulates power, solves for the sample size that reaches the target, audits the design for flaws, and revises — until power clears the target and the flaw list is empty. The deliverable is a sample-size justification plus a preregistration that pins the hypothesis, primary outcome, analysis, sample size, and stopping rule before any data is seen.
This loop does exactly three things, in a loop: (1) computes power and required sample size for a
two-group comparison by simulation, (2) runs a fixed validity checklist over the design,
and (3) writes a preregistration. The vendored power model (tools/power_sim.py) covers
two-sample mean (continuous outcome) and two-proportion (binary outcome) tests only.
It is not a general experiment designer. It does not handle factorial, repeated-measures, clustered/multilevel, time-series, adaptive, or survival designs; it does not pick your outcome measure or manipulation from domain knowledge; and it does not analyze data you have already collected. For those, the power numbers here do not apply — use a design-appropriate power method. If the study is not a simple two-arm comparison, say so and stop rather than reporting a power that does not match the planned analysis.
Use this to size and preregister one two-arm comparison whose primary outcome is a continuous mean or a binary rate. Default to powering for the minimal effect of interest the user states; if they are unsure of that effect, help them set it from a baseline and a smallest-meaningful difference rather than an optimistic guess — a design "powered" for an effect bigger than reality is a fiction. If the study is not a two-arm comparison, stop and point to a design-appropriate method.
Resolve bindings interactively. If loop.run.yaml exists in the working dir, load it, confirm the
values in one line, and skip to the loop. Otherwise: on Claude Code (the AskUserQuestion tool is
available) infer a likely value for each binding and present it as the recommended option; on other
hosts ask each as a quoted plain-text prompt. Then write loop.run.yaml (format:
examples/run.example.yaml) and confirm the values before creating any other files.
| binding | meaning | default | how to infer |
|---|---|---|---|
<hypothesis> | the claim the experiment tests | — | ask the user |
<outcome> | primary outcome type + minimal effect of interest: continuous (baseline_mean, sd, min_effect) or binary (baseline_rate, min_lift) | — | ask; this fixes the effect size power is computed at |
<target_power> | power the design must clear | 0.80 | — |
<alpha> | significance level | 0.05 | — |
<power_cmd> | invocation of the vendored simulator | python3 <skill_dir>/tools/power_sim.py --design <two-sample-mean|two-proportion> --effect <e> [--sd <sd> | --baseline <p0>] --alpha <alpha> --n <n_per_group> | — |
<design_doc> | output design + preregistration file | <sandbox_root>/design.md | — |
<sandbox_root> | where design + ledger live | ./sandbox | — |
<budget> | max iterations | 8 | — |
<power_cmd> prints one JSON object, {"power", "n_per_group", ...}. Run it to get the power; never
estimate power by hand.
Copy this checklist and tick items off:
<design_doc>; record nothing as final.<power_cmd> at the current n and the assumed effect.< <target_power>, re-run at larger n (step up, then bisect) until it clears.n to the power-adequate value, update <design_doc> (+ Preregistration section).<budget>.Iteration 0 — draft. Write a first design to <design_doc>: the arms/conditions, the unit of
analysis and how units are assigned, the primary outcome and the exact planned test, the assumed effect
size (from <outcome>), and a first sample-size guess. Record nothing as final yet.
Then, until stop (power met + no flaws, or budget):
<power_cmd> at the current per-group n and the assumed effect, with the
--design matching the planned test. Record the achieved power.power < <target_power>, re-run the simulation at larger n — step up (e.g. double),
then bisect — until power clears the target, and adopt that n.n to
the power-adequate value. Update <design_doc>, including a Preregistration section: hypothesis,
primary outcome, the one planned analysis, sample size + how it was derived, randomization scheme, and
the stopping rule.Stop when power ≥ <target_power> and the flaw list is empty, or at <budget>. Report the
final design + preregistration, the achieved power and required n, and — if stopping on budget — the
flaws still outstanding.
<sandbox_root>/ledger.tsv, tab-separated, never commas in the text. Header:
iter n_per_group power open_flaws change
Example:
iter n_per_group power open_flaws change
0 50 0.50 2 draft: volunteers vs last-year cohort, n=50
1 100 0.80 1 solved n for 80% power at d=0.4
2 100 0.80 0 randomized concurrent control; pre-specified single primary outcome + stopping rule
Report the best iteration: the final design, the achieved power and required n, and any flaws
still open if stopping on budget.
--design in the simulation
must match the test named in the design. Do not edit tools/power_sim.py.../ escapes. Do not pause the loop to ask whether to continue.npx claudepluginhub gaasher/agent-loop-skills --plugin agent-loopsProvides Python code patterns for reproducible experiments: random seeds, environment logging, train/test splits, cross-validation, A/B testing, and power analysis. For ML/statistical designs.
Guides users through randomized experiments (RCTs/A/B tests) with power analysis, balance checks, and robust standard errors in R or Python.
Designs statistically rigorous A/B tests and interprets experiment results with ship/iterate/kill recommendations. Calculates sample size, run time, and flags design risks.