From medsci-project
Generates a citable data dictionary/codebook from tabular datasets (CSV/TSV/Excel/Parquet/Stata/SAS). Profiles each variable's role, type, missingness, and distributions, flagging coded values as [NEEDS DICTIONARY].
How this skill is triggered — by the user, by Claude, or both
Slash command
/medsci-project:generate-codebookinheritThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You help a medical researcher turn a raw tabular dataset into a structured,
You help a medical researcher turn a raw tabular dataset into a structured,
citable data dictionary (codebook). This is the generator side of the
dictionary-first workflow: it produces the artifact that /define-variables and
dictionary-first QC later consume. You generate code and review output — you do
not invent the meaning of coded values.
A codebook describes what is in the data, not what the codes mean. Column
distributions, types, and missingness are observable and safe to profile. The
meaning of a coded value (fatty_liver_grade = 0) is NOT observable from the
data — it lives in the authoritative data dictionary. This skill profiles the
former deterministically and explicitly flags the latter as [NEEDS DICTIONARY]
so a human fills it from the source. This is the generator counterpart to the
dictionary-first rule that /define-variables enforces on consumption.
${CLAUDE_SKILL_DIR}/references/codebook_schema.md — the
codebook.json schema, the role-inference heuristics, and how the output threads
into /define-variables and dictionary-first QC. Read this before interpreting output.Run the bundled profiler rather than describing columns from memory:
python "${CLAUDE_SKILL_DIR}/scripts/generate_codebook.py" data.csv --out-dir .
Supports .csv/.tsv/.xlsx/.parquet/.dta/.sas7bdat. Flags: --max-levels N
(categorical cutoff, default 20), --json-only, --md-only. The script is
pandas-only, runs locally, and never sends data anywhere.
Run generate_codebook.py on the dataset. It writes codebook.json (machine-
readable) and codebook.md (review table), reporting per variable: role
(id / continuous / categorical / binary / date / text), dtype, missingness,
unique count, level frequencies or quantile summary, and a needs_dictionary flag.
Present codebook.md and walk the user through it. Gate: the user confirms
the inferred roles (e.g., an integer-coded scale mis-read as continuous, or an id
column). Do not proceed to definition work until the user approves the role
assignments.
For every variable flagged needs_dictionary: true, the level codes are
uninterpretable without the authoritative source. Gate: ask the user to
supply the meaning of each code from the real data dictionary (file/sheet/row),
or to confirm none exists. Fill label, units, and per-level meanings into the
codebook only from that source — never from inference. If the user cannot
supply it, leave the [NEEDS DICTIONARY] marker in place; do not erase it.
The completed codebook.json becomes the input dictionary for /define-variables
(operationalization) and the citation source for dictionary-first QC. Gate:
confirm with the user that no needs_dictionary flags remain unresolved before
the codebook is treated as authoritative for downstream analysis.
.dta), SAS (.sas7bdat).[NEEDS DICTIONARY])./clean-data./deidentify before sharing./define-variables (this skill feeds it).codebook.json as its data dictionary input.codebook.json (schema in references) and codebook.md (review table with a
"Columns requiring dictionary lookup" section). Summarize the counts
(rows, columns, needs_dictionary_count) in chat; do not paste the full JSON.
Input cohort.csv:
patient_id,age,sex,fatty_liver_grade,smoking_status,visit_date
1001,54,1,0,never,2023-01-15
1002,61,2,2,former,2023-02-03
Run:
python "${CLAUDE_SKILL_DIR}/scripts/generate_codebook.py" cohort.csv --out-dir .
# -> {"n_rows": ..., "n_columns": 6, "needs_dictionary_count": 2, "outputs": [...]}
codebook.md (excerpt):
| Variable | Role | Missing % | Unique | Needs dictionary |
| `patient_id` | id | 0.0 | N | |
| `age` | continuous | 0.0 | ... | |
| `sex` | binary | 0.0 | 2 | ⚠️ YES |
| `fatty_liver_grade` | categorical | 0.0 | 5 | ⚠️ YES |
| `smoking_status` | categorical | 0.0 | 3 | |
| `visit_date` | date | 0.0 | ... | |
sex and fatty_liver_grade are flagged because their levels are bare codes
(1/2, 0..4). smoking_status is not flagged — its levels are already
human-readable. The reviewer then:
sex: 1 = male, 2 = female and fatty_liver_grade: 0 = none … 4 = suspected
into the codebook from that source (citing file > sheet > row).[NEEDS DICTIONARY] flags remain, then hands codebook.json to
/define-variables.What the skill must never do: write sex: 1 = male because "that is the
usual coding." If the dictionary is unavailable, the flag stays.
[NEEDS DICTIONARY];
the meaning is filled only from the authoritative data dictionary, then cited.npx claudepluginhub aperivue/medsci-skills --plugin medsci-presentationInteractive three-stage data profiling and cleaning assistant for medical research. Profiles CSV/Excel clinical data, flags issues (missing values, outliers, duplicates, type mismatches), and generates cleaning code — all decisions require researcher confirmation.
Analyzes CSV, Excel, parquet, or table-like files with reproducible scripts, data profiling, validation, and structured summaries.
Profiles CSV/TSV/Excel files: detects format, counts rows/headers, computes basic/advanced statistics (kurtosis, Gini, outliers), shows top value distributions.