From medsci-project
Fingerprints datasets with content-hash manifests for reproducibility. Verifies later copies against the manifest to detect schema, row-count, or value drift.
How this skill is triggered — by the user, by Claude, or both
Slash command
/medsci-project:version-datasetinheritThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You help a medical researcher put a dataset under version control: fingerprint it,
You help a medical researcher put a dataset under version control: fingerprint it, detect when it changes, and lock a reproducible version. This guards the data-integrity rule — an analysis must run on the data it claims to, with a fixed seed — by making any drift between runs loud instead of silent.
A dataset is an input to a result; if it changes silently, every downstream number is suspect. This skill records a deterministic fingerprint (file SHA-256 +, for tabular files, schema and per-column value hashes) so a later run can prove the inputs are unchanged. It does not alter data, and it records nothing non-deterministic (no timestamps unless explicitly passed), so the same data always yields the same manifest.
${CLAUDE_SKILL_DIR}/references/manifest_schema.md —
the manifest.json structure, what each drift category means, and the non-
deterministic-artifact policy (PPTX/DOCX timestamps). Read before interpreting drift.# Build a manifest (record the analysis seed + provenance)
python "${CLAUDE_SKILL_DIR}/scripts/version_dataset.py" manifest data.csv \
--out manifest.json --seed 42 --provenance "KNHANES 2018 extract v1"
# Verify a later copy against it (CI / pre-analysis gate)
python "${CLAUDE_SKILL_DIR}/scripts/version_dataset.py" verify --manifest manifest.json --strict
# Compare two manifests (what changed between versions)
python "${CLAUDE_SKILL_DIR}/scripts/version_dataset.py" diff --old v1.json --new v2.json
File hashing is stdlib-only; tabular schema/column hashing uses pandas when present.
--ignore-cols excludes volatile columns; --base makes manifest keys relative.
Build the manifest at the moment the dataset is frozen for analysis. Gate: confirm with the user the seed and provenance note are correct before locking — the manifest is the record they will cite as "this is the data the results came from."
Before re-running an analysis (or in CI), verify --strict. Gate: if drift is
reported, stop and show the user the drift report; do not proceed on changed data
without their explicit acknowledgement and a re-lock. Silent re-run on drifted data
is the failure this skill exists to prevent.
When a dataset is intentionally updated, diff the old and new manifests and
present the change set (added/removed/changed columns, row-count delta) so the
user can record what changed and re-lock. Gate: the user approves the new
version before it replaces the locked one.
Some outputs (PPTX/DOCX with embedded timestamps, figures with render metadata)
change byte-for-byte on every build even when the analysis is identical. Do not
put these under strict byte verification — manifest only the deterministic inputs
and tabular outputs (data files, result CSVs), or use --ignore-cols for volatile
columns. See references for the policy.
/clean-data, /generate-codebook, /deidentify.demo/*/ carries a manifest.lock.json (input data + deterministic result tables) that verify --strict checks.Lock a freshly-frozen extract:
python "${CLAUDE_SKILL_DIR}/scripts/version_dataset.py" manifest cohort.csv \
--out manifest.json --seed 42 --provenance "KNHANES 2018 extract, frozen 2026-05"
# -> {"files": 1, "out": "manifest.json"}
Before re-running the analysis next month:
python "${CLAUDE_SKILL_DIR}/scripts/version_dataset.py" verify --manifest manifest.json --strict
# OK: 1 file(s) match the manifest. (exit 0 — safe to run)
If someone silently re-exported the data with three extra rows:
=========================================
Dataset Manifest Verify
=========================================
DRIFT (3):
ROW COUNT cohort.csv: 3457 -> 3460
CHANGED column cohort.csv:bmi
CHANGED column cohort.csv:hba1c
MANIFEST_DRIFT: dataset differs from manifest. (exit 1 — STOP)
The analysis does not proceed: the result the manuscript will cite would no
longer match the locked data. The researcher reviews the drift, decides whether
the change is intended, and only then re-locks (manifest again) and records the
new provenance. A tabular file is compared on its logical content (schema +
per-column value hashes), not raw bytes — re-saving the same data, reordering
columns, or an --ignore-cols volatile timestamp column does not trip a false drift.
verify.provenance note is user-supplied text.npx claudepluginhub aperivue/medsci-skills --plugin medsci-presentationAudits cheminformatics datasets with Fourches-style 5-component health scores, detects contradictory bioactivity labels in duplicates, compares versions, and generates curation reports from CSV/SDF files.
Enforces that every quantitative or methodological claim in academic manuscripts comes from pipeline-generated files, never hand-typed. Activates when editing .qmd, .Rmd, .ipynb, .tex, or .md files containing numbers, statistics, or methodology facts.
Logs data analysis operations in JSONL journals with SHA256 hashes, metadata, and tool params for independent end-to-end reproducibility.