Skill

regression

The discipline behind maestro:regression — prove a change didn't break existing work by running the suite, an adversarial bounded blast-radius probe against documented architecture contracts, and honest coverage, then emit a verdict taskmanager's done-gate enforces. Trigger: a change is about to be marked done / committed and you must confirm it didn't regress existing behavior or violate a documented boundary.

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/maestro:regression

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

The regression gate answers one question: **did this change break anything that already worked?** It

SKILL.md

174 lines · ~3.1k tokens

Stats

LanguageShell

Parent stars1

MaintenanceExcellent

Last CommitJun 22, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Regression — prove the change didn't break existing work

The regression gate answers one question: did this change break anything that already worked? It is the suite's "new work never breaks existing work" guarantee. It is a maestro capability — the brain reused, never reinvented: the hard part (the blast-radius reasoning) is deep-analysis scoped to a diff, hardened by adversarial-verify.

The standard: elastic ceremony, invariant rigor

Per adversarial-verify (referenced, never overstated): refute-first is always on; multiple lenses are for what matters. Simple, reversible, low-risk changes get a single refute-first pass — verifying them from many angles "turns into ritual" (adversarial-verify/SKILL.md line 33). A high-blast-radius or irreversible change (touches a documented boundary, a data model, a public contract, or many call sites) earns multiple lenses (that skill's "Step 3 — if it matters", line 65). Rigor never weakens; the number of lenses scales with the stakes.

The four parts

Run the existing suite. Discover and run the project's tests; any failure is a regression. No discoverable test command → record it honestly as unverified-risk, never a fake pass. Run the CURRENT source, not a stale build cache. A regression masked by stale bytecode/objects is a silent false-GREEN — the gate's single worst failure (it ships a regression as pass). For Python, ensure fresh bytecode (clear __pycache__, or python -B / PYTHONDONTWRITEBYTECODE, or a clean checkout in CI); the same guard applies to any compiled/bundled/transpiled cache. (Validated end-to-end: a same-second, same-byte-size source edit can otherwise reuse a stale .pyc and hide the change.) The characterization net (below) inherits this — its compare rides this suite-run. A failure is not automatically your diff, either. When the full suite fails but your targeted tests pass, bisect by file and rule out interpreter/resource limits — memory exhaustion/OOM, time limits, a shared global — before concluding your change caused it. Don't blame your diff for a hit memory ceiling.
Bounded blast-radius probe (the brain). Refute-first, over the diff + the documented docs/architecture/ contracts — bounded to this one context (no thoughts/ substrate, no whole-repo read). Hunt, assuming the change breaks something:
- Dependents: what reads/calls the changed symbol, table, or endpoint? (grep the call sites.) Does the change alter a contract they rely on?
- Documented boundaries: does the change make a module reach across a documented seam, write data another boundary owns, or diverge from an interface contract in docs/architecture/?
- Irreversibles: does it change a data model, a public contract, or a migration in a way that is expensive to undo? Size the lenses to the blast radius (above). Cite specific evidence (path:line, a grep, a contract line) — never "looks fine". Independence — when it matters. Probing your OWN change in your OWN context is self-review, which the adversarial-verify standard rates weaker than fresh eyes. For a high-blast-radius or irreversible change (touches a documented boundary / data model / public contract, or many call sites), dispatch this probe as an independent agent (Task tool, refute-first, over the diff + docs/architecture/ contracts) instead of self-probing — fresh independent eyes per the standard. A local, reversible change may probe in-context (single pass is right-sized there).
Honest coverage. Name every changed area with no covering test as unverified-risk. The confidence you report is bounded by what you actually verified; never imply certainty over untested paths. When a changed-untested area is also high-blast-radius (per part 2) and is about to change, flag it as a characterization candidate (see Characterization below) — the case for pinning its current behavior before the change. Low-risk untested code stays a plain advisory.
Verdict. Emit the structured JSON (see commands/regression.md). fail if the suite failed or a contract is violated; else pass. Never emit overridden (a logged human action).

Characterization — pinning risky untested code before you change it

The characterization net — the regression-gate capability the spec deferred (its "part 4"), now built. The discipline for safely changing risky untested code. It adds no schema, no store, no comparator, no command: a golden master is just a normal test in the project's own suite, and the existing part-1 suite-run is the comparator.

When it fires (advisory-first). Only on the intersection changed ∩ untested ∩ high-blast-radius: a path:symbol that is in the diff, appears in part 3's changed_untested, and part 2 rated high-blast-radius (touches a documented contract / data model / public surface / many call sites). Attach the high-risk flag to the specific untested symbol by intersecting part 2's cited evidence (path:line/grep) with the changed-untested location — do not assume a shared key. The gate flags such a symbol as a characterization candidate; capturing is a deliberate act (a human/agent decides), not automatic. Low-risk untested code stays a plain advisory — that restraint is the right-sizing. Cap captures at the top N by blast-radius (dependent count / contract proximity); report the rest as honest residual unverified-risk — never spawn hundreds.

Capture (maestro-only), against UNCHANGED code:

Identify the seam — the smallest observable boundary a test can pin (function return, CLI stdout+exit, HTTP status+body, serialized output).
Capture against the pre-change tree — the load-bearing mechanic that makes it a characterization test, not a tautology. The gate runs at done-time when the change is already made, so record the baseline from the unchanged code: git stash the working change, or capture against git show HEAD:<path> / a clean checkout of the task's base commit, then restore. (Preferred: capture during the implement phase, before editing; fallback: this base-commit capture at done-time.)
Enumerate a small, deliberately diverse input set — happy path + boundary + empty/null + error path, by judgment. A master that pins only the happy path is a false all-clear, so edge/error paths are required. Not exhaustive, not generated (no fuzzer).
Record observed outputs AS the assertions (current behavior, quirks and all) — using the stack's native snapshot/approval idiom (.snap, *.approved.txt, testdata/*.golden, pytest-regressions) when present, else literal asserts. No char-net-specific format or serializer.
Handle nondeterminism honestly — pin entropy at the source where cheap (fixed seed / TZ / LC_ALL); mask timestamps/UUIDs/ordering in the assertion. A value you cannot make deterministic is NOT pinned — report it as residual unverified-risk, never flaky-pinned (a flaky golden cries wolf and gets overridden into irrelevance).
Prove GREEN on the unchanged code (a master that fails on current code captured nothing — fail-closed), then stage the test+fixture into the task's normal commit. The skill does not own a commit — run never commits; the user's commit discipline does.

Storage + label. A characterization test is an ordinary committed test file in the project's native test tree, grouped under a characterization/ subfolder by convention (a folder, not infrastructure). Mark it with a header comment — never inside a snapshot payload: CHARACTERIZATION — pins current behavior of <path:symbol> as-of <commit>; NOT a correctness claim; do not edit assertions to match a change without re-blessing.

Compare = the suite-run (no separate comparator). After the change, part 1 re-executes the committed characterization test. GREEN = behavior preserved. RED = the change altered pinned behavior → it is a suite failure → status='fail' → regression_checks → v_task_regression → run §8a routes to needs-review. Zero new compare code, zero new routing.

Recording (no schema change). A golden mismatch rides suite_json (it is a suite failure). The existence of a pin rides coverage_json (a free-form TEXT field) — mark the captured path:symbol as characterized with its test/fixture path. Never overload the risk enum with a golden value — risk is a contract-violation axis (violated|at-risk|clear); a behavior-pin is a coverage concept. No table, column, view, CHECK, or schema-version change.

Coverage-loop (so it doesn't re-trigger forever). Once a characterization test exists, count "has a covering characterization test in the suite" as covered for the captured paths — those drop out of changed_untested. Uncaptured paths of the same symbol stay unverified-risk and part 3 keeps reporting them (no false all-clear; not a binary "now tested").

On RED, classify — re-bless, do not override. A failing characterization test is just normal verdict reasoning applied to one more failing test:

Unintended diff → a genuine regression → fail → needs-review (the captured-vs-current diff is the evidence).
Intended diff → re-bless: re-record the assertions to the new behavior in the same commit, with the reason in verdict_reasoning + a capture.log breadcrumb. Re-bless is the correct tool (it updates the invariant to the new intended behavior); a logged override is the wrong tool here — it would ship a behavior change against a now-stale master.

Grow, never shrink (bug → permanent golden case). Every bug later debugged becomes a permanent characterization case — reusing scribe's incident template, which already names "a regression/characterization test" as the prevention artifact. The bug's reproducing input + correct post-fix output is committed and joins the suite forever; the same defect can never silently return. Cases are only added, or re-blessed on an explicit intentional change — removing one is a visible, reviewable git deletion at the same bar as deleting any test. "Improved, never reduced," with no new enforcement.

Degradation. Preservation is universal — a committed characterization test is a plain test, so the degraded suite-only check (and any CI) runs it even with maestro absent. Capture is maestro-only — like the blast-radius probe, it is part of the brain; with maestro absent the net can verify an existing master but cannot create a new one (report that honestly). No discoverable test runner → "cannot pin" reported as unverified-risk, never a false all-clear.

Sizing (per the standard above). One refute-first capture pass for a local seam; escalate to more input-lenses only when the seam is high-blast-radius — re-running N cases is one lens run N times, not N refutation angles. No ritual.

What it is NOT

Not a reinvention of adversarial-verify or deep-analysis — it references and embodies them.
Not a whole-repo audit — that is deep-analysis. This is bounded to one change.
Not the enforcer — taskmanager's done-gate is the one true block (it reads v_task_regression and blocks done on anything but pass/overridden). maestro's verify_before_commit hook surfaces it as a loud advisory at commit time. When maestro is absent, taskmanager runs a degraded suite-only check (the mechanical part needs no brain; the blast-radius probe is maestro-only).

The mantra

A change is safe only because an independent, refute-first pass tried to show it broke something — the suite, a dependent, a documented contract — and failed. Size the rigor to the blast radius; report honestly what you could not verify; emit the verdict and let the gate enforce it.

regression

Popularity

Invocation

Context Preview

SKILL.md

regression

Popularity

Invocation

Context Preview

SKILL.md

Regression — prove the change didn't break existing work

The standard: elastic ceremony, invariant rigor

The four parts

Characterization — pinning risky untested code before you change it

What it is NOT

The mantra

Similar Skills

Regression — prove the change didn't break existing work

The standard: elastic ceremony, invariant rigor

The four parts

Characterization — pinning risky untested code before you change it

What it is NOT

The mantra

Similar Skills