From agent-loops
Autonomous ML research loop that uses a temperature scheduler to force broad exploration with wild swings early, then adaptively picks swing/merge/exploit per iteration with stagnation guard. For open-ended ML campaigns requiring breadth before refinement.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-loops:exploratory-autoresearchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This loop runs hot. Like the standard `ml-autoresearch`, every experiment is followed by a diagnostic
This loop runs hot. Like the standard ml-autoresearch, every experiment is followed by a diagnostic
analysis pass. Unlike it, the type of change at each iteration is set by a temperature
scheduler, not the agent's intuition: it forces wide, diverse swings early (full rewrites,
fundamentally different architectures and training regimes), then drops into an adaptive phase that
chooses between swing (a fresh wild approach), merge (combine two registered approaches), or
exploit (a focused tweak of the best). A stagnation guard bans exploit once it has run
<stagnation_limit> times in a row, forcing a pivot back to swing or merge so the loop never gets
stuck hill-climbing. The feedback signal is <metric> read from the run log; an approaches.md
registry and a move_type per iteration are what make the scheduler work.
You are the researcher. Do not pause to ask for permission once the loop is running.
Use for an open-ended ML campaign where you want forced breadth before refinement — the scheduler
guarantees you sample several distinct families before converging, and the stagnation guard prevents
endless small steps. Default to <swing_budget> = 3 and <stagnation_limit> = 3; raise <swing_budget>
for wider initial exploration. Not for the standard analysis-first ml-autoresearch (use that when you
want the analysis alone to drive each change, with no forced-swing scheduler), not for a single training
run or a fixed sweep, and not for tasks with no measurable scalar metric.
Resolve bindings interactively. If loop.run.yaml exists in the working dir, load it, confirm the
values in one line, and skip to the loop. Otherwise: on Claude Code (the AskUserQuestion tool is
available — record <host> = claude-code) infer a likely value for each binding from the project and
present it as the recommended option; on other hosts (<host> = other) ask each as a quoted plain-text
prompt. Then write loop.run.yaml (format: examples/run.example.yaml) and confirm every value with
the user before creating any other files. For branches strategy, create
git checkout -b autoresearch/<run_tag> (tag from today's date; branch must not exist). For time
gating, write <sandbox_root>/run_with_timeout.sh (timeout $(( <budget> * 60 )) <entrypoint> "$@") and
use it as the run command, hard-killing at 2 × <budget> min; for epochs, patch the epoch cap in an
<editable_files> file.
| binding | meaning | default | how to infer |
|---|---|---|---|
<metric> / <metric_direction> | scalar to optimize + minimize/maximize | — | scan editable files + README for metric names |
<run_cmd> / <entrypoint> | command that runs one experiment end to end | — | pyproject.toml / .venv / README |
<editable_files> | files fair game to edit (never the eval harness) | — | model / config / train scripts; exclude data, logs, env, harness |
<sandbox_root> | where snapshots + ledgers live | ./sandbox | next to the editable files |
<iter_strategy> | snapshots or branches | snapshots | is the working dir a clean git repo? |
<gate> / <budget> | time (min) or epochs, and the limit | — | existing time/epoch settings in config |
<swing_budget> | forced wild swings before adaptive mode | 3 (3–5) | wider = more initial breadth |
<stagnation_limit> | max consecutive exploits before a forced pivot | 3 | — |
FILE EDIT GUARD: before touching any file at any point — setup or loop — confirm it is in
<editable_files>, because everything else is read-only ground truth (the eval harness defines
<metric>). No exceptions.
Create the layout and write the ledger headers:
<sandbox_root>/
├── loop.run.yaml ← resolved bindings (written now)
├── results.tsv ← experiment ledger, header only (written now)
├── approaches.md ← registry of every distinct approach (header only, written now)
└── iter1/ ← created at loop start
results.tsv header (tab-separated; move_type ∈ {swing, merge, exploit}):
iter <metric> status move_type analysis_summary description
approaches.md header: # Approach Registry plus a one-line note that the merge step consults it to
find complementary approaches to combine.
Iteration 1 is always the unmodified baseline (it does not count as a swing): skip move-selection and
change-planning, but still run the mandatory analysis — it is the first empirical anchor iteration 2
builds on. Everything in <editable_files> is fair game (architecture, optimizer, hyperparameters, data
pipeline, loss, init, eval); on swings especially, full rewrites are encouraged. The only constraints are
that the code runs and finishes within <budget>. Epoch efficiency is part of the objective — a
change that reaches the same score in fewer effective steps is a real win. Simplicity criterion: all
else equal, simpler is better — a 0.001 gain that adds 20 lines of hacky code is not worth it; a 0.001
gain (or an equal metric) from deleting code is a keep.
The scheduler keeps two counters in memory across iterations: swings_taken (total swing iterations,
excludes the baseline) and consecutive_exploit (exploits since the last swing/merge; resets to 0 on
any swing or merge).
Copy this checklist each iteration and tick items off:
git log --oneline -5. snapshots: confirm iter<N>/
doesn't exist. Read iter N-1's analysis summary and the two counters.move_type (iteration 1: SKIP — baseline). Apply the scheduler below, then record the
move before touching any file.<editable_files> it touches. See The three moves.iter<N>/{code_snapshot,analysis,results}/, copy every <editable_files> into code_snapshot/,
copy loop.run.yaml to iter<N>/, then apply. branches: apply, then
git commit -am "<move_type>: <desc>".tee):
<entrypoint> > <sandbox_root>/iter<N>/<run_log> 2>&1 (or run_with_timeout.sh when time-gated).
If it overruns, kill it and treat as a crash.grep '^<metric>:' <sandbox_root>/iter<N>/<run_log>. If empty,
tail -n 50 <run_log>, read the trace, attempt one trivial fix (typo/import); if fundamentally
broken, log crash and continue.approaches.md (swing and merge moves only). See The registry.results.tsv (untracked — never commit). See Ledger.<metric_direction> → keep,
update current-best. Equal/worse/crash → discard/crash; branches git reset --hard HEAD~1,
snapshots restore <editable_files> from iter<N>/code_snapshot/. Apply the simplicity criterion
before logging discard. On a crash/OOM, fix with the minimal change that preserves the intent
(OOM → smaller batch + grad-accum to hold effective batch) — never mutate the experiment.Follow the rules exactly, in order — they are hard constraints, not suggestions:
IF iter == 1 → baseline (run unmodified; no move)
ELIF swings_taken < <swing_budget> → swing (forced exploration)
ELIF consecutive_exploit >= <stagnation_limit> → swing OR merge (forced pivot — exploit BANNED)
ELSE → agent chooses: swing / merge / exploit
On the free ELSE branch, let iter N-1's analysis decide:
approaches.md entries have distinct, non-overlapping strengths (prefer parents that
changed different axes — they combine additively rather than interfere).if move_type in {swing, merge}: swings_taken += 1 (swing only); consecutive_exploit = 0
elif move_type == exploit: consecutive_exploit += 1
approaches.md and name what is taken from each; the result is a
new approach that is not a minor variant of either parent. Prefer components from different axes.This is the spine that feeds the next move. Run whatever analysis most increases your understanding of
why this result happened. Every analysis script goes in iter<N>/analysis/; every output (plots, CSVs,
text) goes in iter<N>/results/, redirecting stdout there. Do not proceed until the results exist —
analysis that wrote no file did not happen. Dimensions to draw from (choose what fits): gradient
norms/flow, activation stats/saturation, embeddings (PCA/CKA/collapse), error & confusion analysis, loss
dynamics & headroom (was it still improving at cutoff?), weight/parameter stats, data profiling
(often the highest-yield), compute profiling.
Write a concise analysis summary (3–8 bullets): what you examined, the single most important finding, and what it implies for the next move (whether it favours swing / merge / exploit).
Ablation discipline (after any keep that touched more than one axis): you don't yet know which part
caused the gain — flag it and consider an ablation exploit next, reverting one component at a time.
Building on an unablated multi-axis change is building on an unknown.
Forward-looking instrumentation. After analysing, ask "what would I wish I had logged?" — if the
producing script is in <editable_files>, add it now (best-epoch checkpoint, per-layer grad norms,
per-class accuracy). Richer logs improve every future analysis; adding instrumentation is a valid
iteration on its own (log it as move_type=exploit).
approaches.md, swing and merge only)Append an entry; the axes changed field is what the merge step uses to find complementary parents (two
that changed different axes beat two that both changed the architecture):
## Approach <N>: <short name>
- **move_type**: swing | merge
- **iter**: <N>
- **<metric>**: <value> (status: keep | discard | crash)
- **axes changed**: architecture | initialization | data pipeline | optimizer | lr schedule | evaluation | objective | other
- **key ideas**: <what makes this approach distinct>
- **strengths** / **weaknesses** (from analysis): <what worked / what the analysis says is missing>
- **ablated?**: yes (which components isolated, what was found) / no
- **parents** (merge only): Approach X + Approach Y — axes taken from each
<sandbox_root>/results.tsv, tab-separated, never commas in free text. Status ∈ {keep, discard,
crash}; use 0.000000 for the metric on crashes. For branches, the first column is the 7-char commit
instead of the iter number.
iter <metric> status move_type analysis_summary description
1 0.6320 keep swing baseline; grad norms clean, no pathologies baseline
2 0.5910 discard swing ResNet blocks; gradient flow good but overfit on 5 epochs ResNet-style residual blocks
3 0.6890 keep swing MLP-Mixer; feature mixing effective, less overfit MLP-Mixer token+channel mix
4 0.7120 keep merge Mixer channel-mix + ResNet skips; best of both merge: Mixer + ResNet skips
5 0.7250 keep exploit train/val gap small; warmup helped stability add 2-epoch LR warmup
6 0.7240 discard exploit no gain from dropout; val acc unchanged add dropout 0.1
7 0.6800 discard swing forced pivot after stagnation guard; ViT too data-hungry ViT patch-16 from scratch
approaches.md registry — one entry per swing/merge, format above. Example:
# Approach Registry
The merge step consults this file to find complementary approaches to combine.
## Approach 3: MLP-Mixer
- **move_type**: swing
- **iter**: 3
- **val_acc**: 0.6890 (status: keep)
- **axes changed**: architecture
- **key ideas**: token mixing + channel mixing, no convolutions
- **strengths**: less overfit than ResNet, fast per-epoch
- **weaknesses**: spatial locality not exploited; edges/textures underused
- **ablated?**: no
Report the best iteration (not necessarily the last) when summarising. Do not commit results.tsv or
approaches.md — leave them untracked.
<editable_files> — confirm before every edit, because everything else is
read-only ground truth and the eval harness defines <metric>.consecutive_exploit >= <stagnation_limit> you MUST choose swing or merge — exploit is unavailable
regardless of what the analysis suggests; this guard is what prevents endless hill-climbing.iter<N>/results/; a move
with no analysis behind it degrades the loop into blind iteration and starves the next decision.<run_log>; never use tee. The sandbox is self-contained — no
../ escapes.This loop runs forever until the human interrupts it — do not pause to ask "should I continue?" or
"is this a good stopping point?". The human may be away and expects autonomous work indefinitely. A
working result is the start of the next iteration, not the end. If ideas run dry: re-read the in-scope
files for missed angles, go deeper on the analysis (gradients/activations/embeddings/errors always
surface something), combine previous near-misses from approaches.md, or take a more radical swing.
npx claudepluginhub gaasher/agent-loop-skills --plugin agent-loopsAutonomous ML research loop that analyzes gradients, activations, and errors after each run to ground the next change in evidence. Optionally searches scientific literature to guide modifications.
Runs an autonomous 5-stage research loop that reads research.md, proposes hypotheses, runs experiments, evaluates results mechanically, keeps improvements, discards failures, and iterates until a target metric is achieved or budget exhausted.