By probabl-ai
Accelerate ML experimentation by automating the full iteration loop: from experiment proposals and project scaffolding to pipeline declarations, model evaluation, audit digest generation, and test-driven validation. Includes environment bootstrapping and code quality enforcement.
Owns the `audit/` folder: one `# %%` (jupytext percent) Python file per experiment, aligned 1:1 with `experiments/NN_<short_name>.py` and `journal/NN_<short_name>.md`, that loads the experiment's skore report **read-only** and uses bare-last-expression cells whose `__repr__` carries the audit's signal. The agent executes the audit file via the bundled in-process runner (`audit-ml-pipeline/scripts/run_cells.py` — IPython `InteractiveShell.run_cell`), which streams a markdown digest of each cell's stdout + last-expression repr to stdout (optionally also to a file). The digest fuels narrative work (the `JOURNAL.md` Status + History update, follow-up questions about a past experiment, cross-experiment comparison). Stops at "audit/NN_*.py is placed, executed, and the digest is available." Never calls `skore.evaluate(...)` or `project.put(...)`. TRIGGER — any of: - `iterate-ml-experiment` § 4 record-outcome — audit is dispatched FIRST (replaces scratch probes for metric extraction). - The user asks "audit experiment 02", "show me what 03 looks like", "re-audit 04 against the new report". - An experiment was re-run (same `put()` key overwritten) and the matching audit file needs re-execution. - The user wants a human-readable narrative of a past experiment without firing the full `iterate-from-skore` flow. SKIP when: the design note isn't approved yet (route to `iterate-ml-experiment`); the experiment hasn't been run (no report on disk); the agent feature isn't installed (delegate to `python-env-manager` § "Agent feature"); the user is mining the report to source the *next* experiment (`iterate-from-skore`); the user wants to explore the **raw dataset** rather than a finished run's skore report (`explore-ml-data` — audit reads a report, not the data). HOW TO USE: confirm the four-way stem pairing exists (`journal/NN_*.md` approved + `experiments/NN_*.py` exists + smoke test passed + report under that key in the Project), then place `audit/NN_<short_name>.py` from `templates/audit.py`, substituting the package name + the literal Project init block copied from `experiments/<stem>.py`. Execute via the bundled runner: `pixi run -e agent python .agents/skills/audit-ml-pipeline/scripts/run_cells.py audit/<stem>.py`. **Read the Stop conditions and emit the Pre-flight checklist before any write or shell command.** Always invoke `python-api` for skore symbol signatures — never write them from memory.
Declare the pipeline from data source to predictor as a **skrub DataOps graph** (not as a bare `sklearn.Pipeline`). Every step is either a pure-Python function (stateless) attached via `.skb.apply_func`, or a sklearn-compatible estimator (stateful) attached via `.skb.apply`. Stops at the declared object — no fit, split, tuning, persistence, or evaluation. TRIGGER — any of: - Writing or editing code that declares any link in the chain *data source → predictor*: loaders, preprocessing, encoders / imputers / scalers, feature steps, composition objects (`Pipeline`, `ColumnTransformer`, skrub `tabular_pipeline`, `nn.Module`), or the final estimator. - A pure-Python data-processing function destined for the pipeline path (cleans / derives / reshapes) — whether wrapped via `FunctionTransformer`, `skrub.@deferred` / `skrub.var`, a custom `BaseEstimator` subclass, or just called in the training path before the estimator. - A step is added, removed, swapped, or reordered inside an existing pipeline declaration. - A bare `sklearn.Pipeline` / `make_pipeline` is being used as the top-level — fire to redirect into a skrub DataOps graph. - The user asks to build / declare / set up a pipeline / classifier / regressor for X. SKIP when: `.fit(...)` calls / training loops / `Trainer.fit` / epoch loops; train/test split or cross-validation splitting; hyperparameter search; persistence (`joblib.dump`, checkpointing); evaluation / metrics / scoring; inference over a pre-trained model; pure EDA; library-choice questions with no concrete declaration in play. HOW TO USE: consult before the first declarative line and on every structural edit (added/swapped step, changed input columns, changed estimator family). Don't re-consult for cosmetic edits. **First, read the Stop conditions and emit the Pre-flight checklist as visible text before any code.** Always invoke `python-api` to confirm skrub / sklearn symbol names and signatures before typing — don't guess from memory.
Opinionated Python stack for data-science / ML work — one library per job, organized into tiers (mandatory / user choice / optional / transitive). SKILL.md is the index; per-library `references/<library>.md` files carry scope, "pick this when" / "pick something else when", and pairings. TRIGGER when (any of these): (1) **a library import fails** in this stack's domain — the answer is install, not substitute (see § "Missing dependency"); (2) **a library choice has to be made** — explicitly (the user asks "which library for X?") or implicitly (code is about to introduce a new dependency, or the project is being scaffolded and the tabular library hasn't been picked yet); (3) starting a new Python data-science / ML project; (4) the user or current code reaches for a substitute outside the stack (xgboost, lightgbm, black, isort, flake8, poetry, hatch), or reaches for `mlflow` to log params/metrics, or for `cross_val_score` + handwritten reporting — redirect: tracking → `skore` Project API, evaluation / reporting → `skore` report classes, `mlflow` stays only for model serving / registry. SKIP when: the project is non-Python; the work is web / backend / infra unrelated to data science; the library is already chosen and installed and the task is implementation inside it (bug fix, feature work, refactor) with no new dependency in play. HOW TO USE: **read this SKILL.md end-to-end before recommending or installing anything** — picking from a single index entry hides the tier (whether the library is mandatory, a user-choice, optional, or already transitively present) and the pairings, and both matter. Then read the linked `references/<library>.md` for the chosen library's scope and tradeoffs. Don't silently substitute one library for another; if no entry fits, surface the gap to the user.
Methodology for evaluating a single sklearn-compatible learner (in particular, the `SkrubLearner` produced by `build-ml-pipeline`). Owns: which entry point to call (`skore.evaluate` first, the explicit report classes when needed), which cross-validator to pick from scikit-learn's catalogue, how to consume the structural metadata (`groups`, `times`, …) attached at build time via `.skb.mark_as_X(split_kwargs=...)`. Stops at "what does the report say". Defaults (metrics, plots) come from skore; only override on explicit user request. TRIGGER when: code calls `cross_val_score`, `cross_validate`, `classification_report`, or any handwritten metric print (`print(mean_squared_error(...))`); code calls `.skb.cross_validate(...)` (route through skore for richer output); user asks how to score, evaluate, or compare a single learner; user asks how to pick a cross-validator; user wants to see a report / metrics / diagnostic plots for a fitted learner. SKIP when: declaring the pipeline (use `build-ml-pipeline`); hyperparameter / model search (separate skill); fitting, persisting, or serving the final model; tracking or comparing experiments across multiple runs over time (separate skill). HOW TO USE: invoke before any evaluation call. **First, read the "Stop conditions" block at the top of the body and emit the Pre-flight checklist as visible text in your response — both are mandatory before any evaluation code is written.** The structural facts about the data (group keys, time ordering) should already be encoded at the X marker via `split_kwargs` — if they aren't and you can't tell from the data, return to `build-ml-pipeline` and ask the user. For symbol-level lookups, defer to `python-api` (skore symbols) and `python-api` (splitters); don't guess names from memory.
Owns data understanding BEFORE any model is designed. Places and executes `data/eda.py` (a jupytext `# %%` script) via the shared in-process runner, reads the streamed digest, then writes a persisted `data/eda.md` report (plus linked `data/eda_<table>.html` skrub `TableReport` pages) and the `## Data understanding (EDA)` section of `journal/JOURNAL.md`. The point is to surface the dataset facts — shape, dtypes, missingness, cardinality, target balance / skew, datetime / group structure, feature associations — that JUSTIFY the later learner / splitter / metric decisions, so the user understands *why* the modelling choices are made. Uses `skrub.TableReport` for dataframe overviews and the shared runner `audit-ml-pipeline/scripts/run_cells.py`. Stops at "EDA executed, `data/eda.md` + HTML written, JOURNAL EDA section updated." Never designs the model, never edits `src/<pkg>/`, never modifies the user's raw data files. TRIGGER — any of: - `iterate-ml-experiment` § 0 bootstrap, BEFORE the baseline design note — the G-EDA gate fires here (run / skip). - The user asks to "explore the data", "do an EDA", "profile the dataset", "what does the data look like", "understand the data". - A new or changed data source needs (re-)understanding before the next experiment. SKIP when: the workspace isn't scaffolded / bootstrapped yet — `iterate-ml-experiment` § 0 owns bootstrap ordering and will dispatch here at the G-EDA step; don't run standalone ahead of scaffolding (route to `iterate-ml-experiment` / `organize-ml- workspace`); there is no data to explore yet; the user wants to inspect a finished run's skore report rather than the raw dataset (`audit-ml-pipeline`); the user is past data understanding and wants pipeline / evaluation mechanics (`build-ml-pipeline` / `evaluate-ml-pipeline`); a pure symbol lookup (`python-api`); EDA is already recorded (`data/eda.md` + the JOURNAL EDA section exist) and the user is not asking to refresh it. HOW TO USE: run the Detection step (does `data/eda.md` + the JOURNAL EDA section already exist?), emit the Pre-flight checklist as visible text, read the Stop conditions, then place `data/eda.py` from `templates/eda.py`, execute it via the shared runner, read the digest, and author `data/eda.md` + the JOURNAL EDA section. Always resolve skrub / pandas / polars symbols via `python-api`, never from memory.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
A set of skills to partner with you throughout your machine learning experimentation journey. It helps you with:
So we aim to let you focus on the science, with AI agents handling the implementation, guided by two important ingredients: great libraries for maintainability and good methodologies to run experiments correctly.
In practice, from a prompt such as:
╭────────────────────────────────────────────────────────────────────────╮
│ > Given the context in the file `data/README.md` and the data located │
│ in `data/`, let's build a first machine learning pipeline that will │
│ serve as baseline for the next experiments that we are going to run │
│ together. │
╰────────────────────────────────────────────────────────────────────────╯
you can expect your agent to start experimenting with you. The skills work well with models such as Claude Opus and Sonnet and give great results with smaller models such as Qwen 3.6 30B or DeepSeek v4 Flash. As for agent harnesses, we tested them with Claude Code, OpenCode, Cursor, and GitHub Copilot and found no significant difference in terms of skill invocation.
You can install the skills using the skore CLI that you can install from PyPI or from
conda-forge and run the following command.
First install skore-cli:
# with pip
pip install skore-cli
# with uv
uv tool install skore-cli
# with pixi
pixi global install skore-cli
Then run the following command:
skore skills install
You can use uvx or pixi exec to install the skore CLI and directly run the
command in an isolated environment:
uvx --from skore-cli skore skills install
or
pixi exec --spec skore-cli skore skills install
If you prefer npx, then you can use:
npx skills add probabl-ai/skills
If you only use Claude Code and prefer the native plugin flow, this repo is also a Claude Code plugin marketplace:
/plugin marketplace add probabl-ai/skills
/plugin install probabl-skills@probabl-skills
/plugin update pulls new releases.
| Skill | Description |
|---|---|
| explore-ml-data | Explore the dataset before designing any model. |
| build-ml-pipeline | Build a machine learning pipeline from the data source to the learner, including multi-tables engineering. |
| evaluate-ml-pipeline | Evaluate a complex machine learning pipeline and get structured reports including metrics, plots, and diagnostics. |
| test-ml-pipeline | Make sure that your machine learning pipeline is production-ready statistically and functionally. |
| smoke-test-ml-pipeline | Stress test your machine learning pipeline on future data to make sure it works. |
| audit-ml-pipeline | Once testing and the experiment is done, audit by loading a skore report and investigate. |
| Skill | Description |
|---|---|
| iterate-ml-experiment | Design, keep track of experiments and iterate on them. |
| iterate-from-skore | Use skore to run diagnostics and checks that can be reported and addressed in the next experiment. |
| iterate-from-user | As a user be in the loop and propose new experiments — free-text, a scientific article URL, or a resource link (GitHub issue / spec / reference repo). |
| Skill | Description |
|---|---|
| organize-ml-workspace | An organized workspace to keep track of your experiments. |
| python-code-style | Enforce good practices out-of-the-box for the Python ecosystem for your code. |
| python-env-manager | Bootstrapping the experiment setup based on your favorite Python environment manager. |
| data-science-python-stack | Opinionated one-library-per-job Python stack, organized into mandatory / user-choice / optional / transitive tiers. |
npx claudepluginhub probabl-ai/skills --plugin probabl-skillsSet up ML experiment tracking
Data science and ML workflow tools. 9 agents, 8 commands, 19 skills, 9 templates for problem framing, preprocessing, validation, EDA, experimentation, review, deployment, and knowledge compounding.
ML engineering plugin: Give your AI coding agent ML engineering superpowers.
Automate ML workflows with Airflow, Kubeflow, MLflow. Use for reproducible pipelines, retraining schedules, MLOps, or encountering task failures, dependency errors, experiment tracking issues.
ML/perf investigation skills: topic, plan, judge, run, sweep
DataRobot skills for AI/ML workflows — model training, deployment, predictions, feature engineering, monitoring, explainability, data preparation, App Framework CI/CD, and external agent monitoring.