From full
Plans verification and validation campaigns for simulation codes using manufactured solutions (MMS), benchmark problems, grid/time refinement, uncertainty propagation, and pass/fail criteria. Use to prove solver, model, or result trustworthiness.
How this skill is triggered — by the user, by Claude, or both
Slash command
/full:benchmark-and-mms-plannerThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Design a verification and validation plan before trusting simulation results. The skill helps agents choose manufactured solutions, benchmark cases, refinement protocols, uncertainty checks, and pass/fail criteria.
Design a verification and validation plan before trusting simulation results. The skill helps agents choose manufactured solutions, benchmark cases, refinement protocols, uncertainty checks, and pass/fail criteria.
| Input | Description | Example |
|---|---|---|
| PDE or model class | Governing family | diffusion, elasticity, phase-field |
| Quantity of interest | Metric to validate | interface velocity, L2 temperature error |
| Dimension | 1, 2, or 3 | 2 |
| Expected order | Formal discretization order | 2 |
| Reference availability | Analytic, benchmark, or none | analytic |
| Risk level | Cost or consequence of wrong result | high |
scripts/benchmark_mms_planner.py emits inputs and results with:
verification_strategyeffective_model — the resolved model family actually used; unknown families fall back to general.mms_planbenchmark_casesrefinement_protocol (dimension, levels, spacing_ratio, expected_order, accept_observed_order_min, include_time_refinement)uncertainty_plan (propagate_inputs, report_error_bars, separate_discretization_and_model_error) — propagation/error-bar guidance driven by risk level and reference type.acceptance_criteriawarningsThe accept_observed_order_min is an engineering screening heuristic, not a certified bound: it is the formal expected_order reduced by a fractional tolerance (10% for high risk, 20% otherwise) and floored at first-order convergence (1.0). The relative band keeps strictness consistent across formal orders. See references/vv_patterns.md.
benchmark_mms_planner.py --json.python3 skills/verification-validation/benchmark-and-mms-planner/scripts/benchmark_mms_planner.py \
--model diffusion \
--quantity "L2 error in temperature" \
--dimension 2 \
--expected-order 2 \
--reference analytic \
--risk high \
--json
This skill plans verification work; it does not run the solver or prove that a physical model is appropriate for an experiment.
Before trusting a result that used this planner, record concrete evidence for each item:
benchmark_mms_planner.py --json and saved the inputs block, confirming the echoed dimension, expected_order, reference, and risk match the actual run (a fallback to effective_model: general was intentional, not a typo in --model).refinement_protocol: used the reported levels (3, or 4 for high risk) of systematically refined grids at spacing_ratio 2, and recorded the observed order of accuracy from those runs.accept_observed_order_min; if below, logged the investigation (mesh not yet asymptotic, boundary/source errors, limiter activation) rather than treating the result as passed.include_time_refinement is true, ran a separate time-step refinement study and recorded the temporal observed order, not just the spatial one.mms_plan.manufacture_solution is true, derived the symbolic source/forcing term, applied the matching boundary terms, and recorded the L2 and Linf error norms versus the manufactured solution.acceptance_criteria item with a number: conservation/balance closes within a documented tolerance, the quantity of interest plateaus under refinement, and any benchmark discrepancy from benchmark_cases is explained before production use.warnings as blockers for high-risk claims and recorded how each was resolved (e.g. an independent analytic/published reference was added when reference was none or experimental).| Tempting shortcut | Why it's wrong / what to do |
|---|---|
| "The planner ran and printed a plan, so the result is verified." | The script only plans V&V; it never runs the solver. Verification comes from executing the refinement_protocol, MMS, and acceptance_criteria, not from generating the plan. |
| "Two grids converged, so the observed order is fine." | refinement_protocol.levels is 3 (4 for high risk) for a reason: you need >=3 systematically refined grids to estimate observed order and confirm the solution is in the asymptotic range before quoting it. |
"Observed order beats accept_observed_order_min, so it's certified." | That threshold is an engineering screening heuristic (formal order minus a 10%/20% relative tolerance, floored at 1.0), not a certified bound. For rigorous order verification run a Richardson/GCI study. |
"Steady-looking model, so I can skip include_time_refinement." | If the planner set include_time_refinement: true (any time-dependent or general fallback family), spatial refinement alone hides temporal error — run the time-step study too. |
| "We matched a benchmark, so the code is validated." | Matching benchmark_cases or converging shows the code approaches some solution; it does not prove the physical model is correct. Validation needs an independent reference plus model-error separation, not convergence alone. |
"reference none is fine, the runs look physical." | With reference: none the strategy is verification-only; warnings says so. You may report convergence and conservation but must NOT call the result validated. |
"Unknown model name, so I'll ignore the general fallback." | An unrecognized --model silently resolves to effective_model: general; confirm that fallback is intended, since it changes both benchmark_cases and the time-refinement decision. |
All inputs are command-line arguments parsed by argparse; validation happens in plan_vv (and partly in the parser). Any rejected input causes the script to print the error to stderr and exit with code 2.
dimension must be exactly 1, 2, or 3; any other integer is rejected.expected_order must be a positive, finite number (NaN, infinity, zero, and negatives are rejected).risk must be one of the allowlist low, medium, high (enforced both as an argparse choice and re-checked in plan_vv).reference must be one of the allowlist analytic, benchmark, experimental, none (enforced both as an argparse choice and re-checked in plan_vv).model and quantity are capped at 256 characters (MAX_FIELD_LEN); longer strings are rejected.model and quantity is otherwise not allowlisted or sanitized: quantity is echoed verbatim into the output, and an unrecognized model family is silently resolved to general rather than rejected.model and quantity; numeric outputs are bounded by the validated inputs.The frontmatter declares allowed-tools: Read, Bash, Write, Grep, Glob.
Bash is used solely to run the bundled scripts/benchmark_mms_planner.py (e.g. the python3 ... --json invocation in the Workflow).Read, Grep, and Glob are for inspecting the skill's own files and references (e.g. references/vv_patterns.md) when planning.Write supports turning the returned protocol into test stubs or checklist files; the planner script itself never writes.eval, exec, or dynamic code execution; the planner is pure Python computing a dictionary.pickle or other deserialization of untrusted data is performed; output is serialized with json.dumps.references/vv_patterns.md for MMS, benchmark, and uncertainty planning notes.refinement_protocol, mms_plan, acceptance_criteria, and warnings) and a
"Common pitfalls & rationalizations" table that pins down domain-specific V&V
shortcuts (plan != verification, >=3 grids for observed order, screening band is not
a certified bound, time refinement, convergence != validation, reference none,
general fallback).script_checks
that pin the planner's specific output (resolved verification_strategy, the
relative accept_observed_order_min band, refinement levels,
include_time_refinement, uncertainty_plan flags, model-specific benchmark
cases, and the exact warning strings) for each of the three cases.general once so benchmark selection and the
time-refinement decision agree (transient unlisted PDEs no longer skip time refinement);
echo the resolved family as effective_model. Replace the fixed absolute observed-order
offset with a relative tolerance floored at first order. Document uncertainty_plan,
effective_model, and the acceptance heuristic. Add 256-character caps on string inputs.npx claudepluginhub heshamfs/materials-simulation-skills --plugin core-numericalDesigns rigorous numerical simulations with formal V&V: defines mathematical models, selects methods (Monte Carlo, FDM, FEM), specifies convergence criteria, and quantifies uncertainty.
Validates materials simulations in three stages — pre-flight config checks, runtime log monitoring for NaN/Inf/collapse, and post-flight result validation (physical bounds, conservation, convergence). Use when launching, monitoring, or debugging simulations.
Designs and executes Monte Carlo simulations to evaluate finite-sample properties of statistical estimators including bias, RMSE, coverage, size, and power.