From maestro
The discipline behind maestro:regression — prove a change didn't break existing work by running the suite, an adversarial bounded blast-radius probe against documented architecture contracts, and honest coverage, then emit a verdict taskmanager's done-gate enforces. Trigger: a change is about to be marked done / committed and you must confirm it didn't regress existing behavior or violate a documented boundary.
How this skill is triggered — by the user, by Claude, or both
Slash command
/maestro:regressionThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
The regression gate answers one question: **did this change break anything that already worked?** It
The regression gate answers one question: did this change break anything that already worked? It
is the suite's "new work never breaks existing work" guarantee. It is a maestro capability — the
brain reused, never reinvented: the hard part (the blast-radius reasoning) is deep-analysis
scoped to a diff, hardened by adversarial-verify.
Per adversarial-verify (referenced, never overstated): refute-first is always on; multiple lenses
are for what matters. Simple, reversible, low-risk changes get a single refute-first pass —
verifying them from many angles "turns into ritual" (adversarial-verify/SKILL.md line 33). A
high-blast-radius or irreversible change (touches a documented boundary, a data model, a public
contract, or many call sites) earns multiple lenses (that skill's "Step 3 — if it matters", line 65).
Rigor never weakens; the number of lenses scales with the stakes.
Run the existing suite. Discover and run the project's tests; any failure is a regression. No
discoverable test command → record it honestly as unverified-risk, never a fake pass.
Run the CURRENT source, not a stale build cache. A regression masked by stale bytecode/objects
is a silent false-GREEN — the gate's single worst failure (it ships a regression as pass). For
Python, ensure fresh bytecode (clear __pycache__, or python -B / PYTHONDONTWRITEBYTECODE, or
a clean checkout in CI); the same guard applies to any compiled/bundled/transpiled cache.
(Validated end-to-end: a same-second, same-byte-size source edit can otherwise reuse a stale
.pyc and hide the change.) The characterization net (below) inherits this — its compare rides
this suite-run.
A failure is not automatically your diff, either. When the full suite fails but your
targeted tests pass, bisect by file and rule out interpreter/resource limits — memory
exhaustion/OOM, time limits, a shared global — before concluding your change caused it. Don't
blame your diff for a hit memory ceiling.
Bounded blast-radius probe (the brain). Refute-first, over the diff + the documented
docs/architecture/ contracts — bounded to this one context (no thoughts/ substrate, no
whole-repo read). Hunt, assuming the change breaks something:
docs/architecture/?path:line, a grep, a
contract line) — never "looks fine".
Independence — when it matters. Probing your OWN change in your OWN context is self-review,
which the adversarial-verify standard rates weaker than fresh eyes. For a high-blast-radius
or irreversible change (touches a documented boundary / data model / public contract, or many
call sites), dispatch this probe as an independent agent (Task tool, refute-first, over the
diff + docs/architecture/ contracts) instead of self-probing — fresh independent eyes per the
standard. A local, reversible change may probe in-context (single pass is right-sized there).Honest coverage. Name every changed area with no covering test as unverified-risk. The confidence you report is bounded by what you actually verified; never imply certainty over untested paths. When a changed-untested area is also high-blast-radius (per part 2) and is about to change, flag it as a characterization candidate (see Characterization below) — the case for pinning its current behavior before the change. Low-risk untested code stays a plain advisory.
Verdict. Emit the structured JSON (see commands/regression.md). fail if the suite failed or
a contract is violated; else pass. Never emit overridden (a logged human action).
The characterization net — the regression-gate capability the spec deferred (its "part 4"), now built. The discipline for safely changing risky untested code. It adds no schema, no store, no comparator, no command: a golden master is just a normal test in the project's own suite, and the existing part-1 suite-run is the comparator.
When it fires (advisory-first). Only on the intersection changed ∩ untested ∩
high-blast-radius: a path:symbol that is in the diff, appears in part 3's changed_untested,
and part 2 rated high-blast-radius (touches a documented contract / data model / public surface
/ many call sites). Attach the high-risk flag to the specific untested symbol by intersecting part
2's cited evidence (path:line/grep) with the changed-untested location — do not assume a shared
key. The gate flags such a symbol as a characterization candidate; capturing is a deliberate
act (a human/agent decides), not automatic. Low-risk untested code stays a plain advisory — that
restraint is the right-sizing. Cap captures at the top N by blast-radius (dependent count /
contract proximity); report the rest as honest residual unverified-risk — never spawn hundreds.
Capture (maestro-only), against UNCHANGED code:
git stash the working change, or
capture against git show HEAD:<path> / a clean checkout of the task's base commit, then restore.
(Preferred: capture during the implement phase, before editing; fallback: this base-commit
capture at done-time.).snap, *.approved.txt, testdata/*.golden,
pytest-regressions) when present, else literal asserts. No char-net-specific format or serializer.TZ /
LC_ALL); mask timestamps/UUIDs/ordering in the assertion. A value you cannot make deterministic
is NOT pinned — report it as residual unverified-risk, never flaky-pinned (a flaky golden
cries wolf and gets overridden into irrelevance).run never commits; the user's commit discipline does.Storage + label. A characterization test is an ordinary committed test file in the project's
native test tree, grouped under a characterization/ subfolder by convention (a folder, not
infrastructure). Mark it with a header comment — never inside a snapshot payload:
CHARACTERIZATION — pins current behavior of <path:symbol> as-of <commit>; NOT a correctness claim; do not edit assertions to match a change without re-blessing.
Compare = the suite-run (no separate comparator). After the change, part 1 re-executes the
committed characterization test. GREEN = behavior preserved. RED = the change altered pinned
behavior → it is a suite failure → status='fail' → regression_checks → v_task_regression
→ run §8a routes to needs-review. Zero new compare code, zero new routing.
Recording (no schema change). A golden mismatch rides suite_json (it is a suite failure).
The existence of a pin rides coverage_json (a free-form TEXT field) — mark the captured
path:symbol as characterized with its test/fixture path. Never overload the risk enum with
a golden value — risk is a contract-violation axis (violated|at-risk|clear); a behavior-pin is
a coverage concept. No table, column, view, CHECK, or schema-version change.
Coverage-loop (so it doesn't re-trigger forever). Once a characterization test exists, count
"has a covering characterization test in the suite" as covered for the captured paths — those
drop out of changed_untested. Uncaptured paths of the same symbol stay unverified-risk and
part 3 keeps reporting them (no false all-clear; not a binary "now tested").
On RED, classify — re-bless, do not override. A failing characterization test is just normal verdict reasoning applied to one more failing test:
fail → needs-review (the captured-vs-current diff
is the evidence).verdict_reasoning + a capture.log breadcrumb. Re-bless is the
correct tool (it updates the invariant to the new intended behavior); a logged override is
the wrong tool here — it would ship a behavior change against a now-stale master.Grow, never shrink (bug → permanent golden case). Every bug later debugged becomes a permanent characterization case — reusing scribe's incident template, which already names "a regression/characterization test" as the prevention artifact. The bug's reproducing input + correct post-fix output is committed and joins the suite forever; the same defect can never silently return. Cases are only added, or re-blessed on an explicit intentional change — removing one is a visible, reviewable git deletion at the same bar as deleting any test. "Improved, never reduced," with no new enforcement.
Degradation. Preservation is universal — a committed characterization test is a plain test, so the degraded suite-only check (and any CI) runs it even with maestro absent. Capture is maestro-only — like the blast-radius probe, it is part of the brain; with maestro absent the net can verify an existing master but cannot create a new one (report that honestly). No discoverable test runner → "cannot pin" reported as unverified-risk, never a false all-clear.
Sizing (per the standard above). One refute-first capture pass for a local seam; escalate to more input-lenses only when the seam is high-blast-radius — re-running N cases is one lens run N times, not N refutation angles. No ritual.
adversarial-verify or deep-analysis — it references and embodies them.deep-analysis. This is bounded to one change.v_task_regression
and blocks done on anything but pass/overridden). maestro's verify_before_commit hook
surfaces it as a loud advisory at commit time. When maestro is absent, taskmanager runs a degraded
suite-only check (the mechanical part needs no brain; the blast-radius probe is maestro-only).A change is safe only because an independent, refute-first pass tried to show it broke something — the suite, a dependent, a documented contract — and failed. Size the rigor to the blast radius; report honestly what you could not verify; emit the verdict and let the gate enforce it.
npx claudepluginhub mwguerra/plugins --plugin maestroCreates bite-sized, testable implementation plans from specs or requirements, with file structure and task decomposition. Activates before coding multi-step tasks.