From ds
Selects and implements train/validation/test split strategies based on data characteristics like time, groups, imbalance, and size. Guides sklearn usage for model evaluation frameworks.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ds:split-strategyThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Select the right train/validation/test split based on your data characteristics. Follow the decision tree below.
Select the right train/validation/test split based on your data characteristics. Follow the decision tree below.
Yes -> Temporal split: Train on past, validate on recent, test on most recent. Never shuffle across time.
TimeSeriesSplit for cross-validationNo -> Continue to next question.
Examples: multiple rows per customer, multiple images per patient, repeated measurements.
Yes -> Group-aware split: Keep all observations of a group in the same fold.
GroupKFold or GroupShuffleSplitNo -> Continue.
Minority class <10% of total.
Yes -> Stratified split: Preserve class ratios across folds.
StratifiedKFold or StratifiedShuffleSplitStratifiedGroupKFoldNo -> Simple random split: Standard train_test_split with fixed seed.
Less than 5,000 rows.
Yes -> Cross-validation: Use 5-fold or 10-fold CV instead of a single holdout. Report mean and std of metrics.
No -> Single holdout is fine (70/15/15 or 80/10/10).
| Dataset Size | Recommended Split | Notes |
|---|---|---|
| <1,000 | Leave-one-out or 10-fold CV | Every data point matters |
| 1,000-10,000 | 5-fold CV or 80/10/10 | CV preferred for reliable estimates |
| 10,000-100,000 | 80/10/10 | Single holdout usually sufficient |
| >100,000 | 90/5/5 or 95/2.5/2.5 | Large test sets are unnecessary |
random_statefrom sklearn.model_selection import train_test_split, StratifiedKFold, GroupKFold, TimeSeriesSplit
# Simple stratified split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Temporal split
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, val_idx in tscv.split(X):
X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
# Group-aware split
gkf = GroupKFold(n_splits=5)
for train_idx, val_idx in gkf.split(X, y, groups=groups):
X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
npx claudepluginhub andikarachman/data-science-plugin --plugin dsSplits datasets like CSV into training, validation, and test sets with ratios and stratification using Python for ML workflows. Activates on split dataset requests.
Evaluates a single sklearn-compatible learner using skore entry points, with guidance on cross-validator selection and structural metadata consumption. Triggers on cross_val_score, cross_validate, classification_report or user requests to score/evaluate a learner.
Enforces ML rigor: baseline comparisons vs dummy/linear models, cross-validation, interpretation, leakage prevention with sklearn templates.